Statistics and Sampling Concepts
Introduction
SingleStore automatically gathers statistics to give the optimizer the information it needs. The system that gathers statistics automatically is known as autostats. The term autostats is also used to describe statistics that are gathered automatically.
It’s almost always best to leave autostats enabled. But if for some reason you don’t want to use autostats, you can disable it. If you turn off autostats, you should use the ANALYZE command to gather statistics manually. Most users can rely on autostats and will never need to run the ANALYZE
command.
The ANALYZE
command also can be used to trigger plan re-optimization when data statistics change, whether or not autostats is enabled.
Autostat Types
SingleStoreDB automatically collects and updates column statistics on columnstore tables, incrementally in the background. This greatly reduces the operational burden around maintaining these statistics with ANALYZE
, ensures they are available without requiring user action, and keeps them up-to-date as table data changes with less performance impact.
Autostats provides automatic maintenance of column statistics. Statistics are collected and incrementally updated in the background as the table is modified with DML statements such as INSERT
, UPDATE
, DELETE
, LOAD DATA
, and TRUNCATE
, as well as ALTER TABLE
.
While statistics are updated automatically, to re-optimize query plans after statistics change, you must trigger plan invalidation by running ANALYZE
.
Running ANALYZE
also saves a copy of the table’s current automatic statistics. If you disable automatic statistics or they become out-of-date (see below), the query optimizer will use the last set of statistics saved by ANALYZE
instead.
SingleStore uses three types of autostats:
Column Statistics Methods | Histogram Methods | Sampling Methods | |
---|---|---|---|
Columnstore | Incremental or Periodic | Create or Update | On or Off |
Rowstore | Periodic only | Create or Update | On only |
Column Statistics
Memory Usage | Best Use Case | Can Disable | |
---|---|---|---|
Columnstore | Incremental: 7kb per column per partition Periodic: ~150 bytes per column per node |
| Yes |
Rowstore | ~150 bytes per column per node |
| Yes |
Column statistics are collected on every column in a table when the table is processed by autostats or the ANALYZE
command. These statistics include the number of distinct values in each column, the number of NULLs in each column, and the number of rows in the table.
You can view the column statistics last collected by querying the information_schema.OPTIMIZER_STATISTICS
table or DESCRIBE INFORMATION_SCHEMA.OPTIMIZER_STATISTICS.
+-----------------------+--------------+------+------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------------+--------------+------+------+---------+-------+ | DATABASE_NAME | varchar(512) | NO | | NULL | | | TABLE_NAME | varchar(512) | NO | | NULL | | | COLUMN_NAME | varchar(512) | NO | | NULL | | | ROW_COUNT | bigint(21) | YES | | NULL | | | NULL_COUNT | bigint(21) | YES | | NULL | | | CARDINALITY | bigint(21) | YES | | NULL | | | AUTOSTATS_ENABLED | tinyint(1) | YES | | NULL | | | AUTOSTATS_OUTDATED | tinyint(1) | YES | | NULL | | | AUTOSTATS_ROW_COUNT | bigint(21) | YES | | NULL | | | AUTOSTATS_NULL_COUNT | bigint(21) | YES | | NULL | | | AUTOSTATS_CARDINALITY | bigint(21) | YES | | NULL | | | ADVANCED_HISTOGRAMS | bigint(21) | YES | | NULL | | | LEGACY_HISTOGRAMS | bigint(21) | YES | | NULL | | | RANGE_STATS | bigint(21) | YES | | NULL | | | SAMPLE_SIZE | bigint(21) | YES | | NULL | | | LAST_UPDATED | datetime | YES | | NULL | | +-----------------------+--------------+------+------+---------+-------+
AUTOSTATS_ENABLED
indicates if cardinality autostats are enabled. The LAST_UPDATED
column shows the time non-incremental statistics were saved. This occurs when ANALYZE
runs, or on periodic autostats collection.
Note
information_schema.OPTIMIZER_STATISTICS.LAST_UPDATED
is not accurate for columnstore tables using incremental autostats. This is because autostats are collected upon ingest and maintained per segment, meaning that they are not stored like other statistics collected by ANALYZE
.
The best way to check that incremental autostats are up to date is to run EXPLAIN on the table in question. If no error is surfaced, autostats are up to date. See Identifying Missing Statistics for more information.
The RANGE_STATS
column indicates whether any kind of histogram is stored for this column. Two types of histograms, advanced or legacy, may be present. The LEGACY_HISTOGRAMS
and ADVANCED_HISTOGRAMS
columns indicate presence of legacy and new histograms, respectively. If you are using legacy histograms (using cardinality_estimation_level set to 6.0), information about the histograms is shown in information_schema.RANGE_STATISTICS
. This table is depreciated as of v7.3 of SingleStoreDB, but will continue to show this information.
AUTOSTATS_ROW_COUNT
, AUTOSTATS_NULL_COUNT
, and AUTOSTATS_CARDINALITY
are deprecated. The columns remain in the table, but their value is always NULL.
Columns in information_schema.OPTIMIZER_STATISTICS
will contain NULL
values if autostats haven’t yet been collected, or autostats is disabled and ANALYZE
has not yet been run.
Method of Cardinality Autostat Collection
Cardinality autostats can be collected in two ways: incremental or periodic. Incremental collection is the default.
Incremental - Incremental autostats are supported only on columnstore tables. The incremental autostats gathering process updates the column statistics of a table incrementally as the table’s data changes. Incremental autostats are enabled by default on all columnstore tables.
Periodic - Periodic autostats are supported on both columnstore and rowstore tables. Periodic autostats collects column statistics on a table over a fixed period, but only after the table’s data has changed significantly. To collect these statistics, the SingleStoreDB engine automatically runs
ANALYZE
on the table, in the background. AsANALYZE
can be an expensive operation, total resource usage by periodic autostats is limited according to thesystem_optimizer_pool
variable.
Histogram or Range Statistics
Memory Usage | Best Use Case | Can Disable | |
---|---|---|---|
Columnstore | For each column with histograms enabled: On each node, ~2KB for int/double types and ~8KB for string types | Range filter that does not increase; e.g. timestamp or auto increment | Yes |
Rowstore | For each column with histograms enabled: On each node, ~2KB for int/double types and ~8KB for string types | Range filter that does not increase; e.g. timestamp or auto increment | Yes |
The ANALYZE
command can additionally collect histograms over designated columns. When histograms are enabled for a column, ANALYZE
will sample the rows in that column and create a histogram from the samples, recording the endpoints of each bucket and statistical information about the distribution of values within each bucket.
When collected, histograms can be queried through the information_schema.ADVANCED_HISTOGRAMS
table or DESCRIBE INFORMATION_SCHEMA.ADVANCED_HISTOGRAMS
.
+-------------------------+---------------+------+------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------------+---------------+------+------+---------+-------+ | DATABASE_NAME | varchar(512) | NO | | NULL | | | TABLE_NAME | varchar(512) | NO | | NULL | | | COLUMN_NAME | varchar(512) | NO | | NULL | | | TYPE | varchar(255) | NO | | NULL | | | VERSION | bigint(21) | NO | | NULL | | | BUCKET_COUNT | bigint(21) | NO | | NULL | | | BUCKET_INDEX | bigint(21) | NO | | NULL | | | RANGE_MIN | varchar(8192) | YES | | NULL | | | RANGE_MAX | varchar(8192) | YES | | NULL | | | UNIQUE_COUNT | double | YES | | NULL | | | CARDINALITY | double | YES | | NULL | | | UNIQUE_COUNT_CUMULATIVE | double | YES | | NULL | | | CARDINALITY_CUMULATIVE | double | YES | | NULL | | +-------------------------+---------------+------+------+---------+-------+
The table will display, in human-readable form, the end points of the histogram buckets along with total and unique count for each bucket. Bucket index -1 corresponds to the number of nulls tracked by the histogram.
If you want to find a list of the columns that have histograms, you can use the following query:
SELECT DISTINCT DATABASE_NAME, TABLE_NAME, COLUMN_NAME FROM information_schema.ADVANCED_HISTOGRAMS;
Method of Histogram Autostat Collection
Histogram autostats can be collected in two ways: create or update. Create collection is the default.
Create - With
AUTOSTATS_HISTOGRAM_MODE
set toCREATE
, if a column is identified at least twice ininformation_schema.mv_prospective_histograms
that having a histogram on it could be beneficial, then the histogram will be automatically created, and updated at that time. Like any histograms you create, it is updated when you runANALYZE
, and also as part of the periodic autostats (same interval: 10 minutes, and actually collected as part of the same process).Update - With
AUTOSTATS_HISTOGRAM_MODE
set toUPDATE
, the existing histograms are kept current, but new histograms are not created.
Sampling
Memory Usage | Best Use Case | Can Disable | |
---|---|---|---|
Columnstore | ~0.1% more disk space needed for samples stored on disk plus ~2% more memory for columnstore metadata | Not an issue since SingleStore separately stores a small percentage of data from tables in a separate internal table called the sampling table. | Yes |
Rowstore | No extra cost | Sampling can't be turned off. Partitions have fairly equal distribution of data. | No |
Columnstore and rowstore tables both can be sampled by the query optimizer. Columnstore tables automatically have a random row-level sample built and maintained when autostats is enabled. This random sample is kept up to date in the background. The stored sample is sometimes called the columnstore sample. If a columnstore must be sampled during query optimization, this row-level sample is used.
Rowstore tables are sampled directly. An actual row-level random sample is collected by scanning the table structure and skipping a large fraction of the rows. This is an efficient process that takes time proportional to the sample size.
Method of Sampling Autostat Collection
Sampling is either enabled or disabled for columnstore tables. Enabled is the default. Rowstore tables only have the enabled option.
Histograms vs Sampling
If histograms are unavailable, the optimizer will use a sample to compute estimates. For rowstore tables, it will use dynamic sampling (i.e. sample the table data by scanning part of the table) to compute estimates. For columnstore tables, it will use the stored columnstore sample.
Histograms are much faster to use than sampling during query optimization, since the optimizer merely needs to read the previously-collected histograms, which are relatively small objects, while sampling may require scanning a large amount of table data.
Whether histogram-based estimates or sampling-based estimates are more accurate depends on the table and query. For example, histograms are most effective on columns where the overall distribution of data is not changing quickly (even if the data itself is changing quickly). When the data distribution is changing, if autostats is disabled, it is important to periodically update histograms to reflect changes in data by re-running ANALYZE TABLE
.
Conversely, histograms are least effective on columns where the data distribution is quickly and continuously changing. For example, a typical case where histograms are generally inaccurate and not recommended is on columns where the range of data is continuously growing, such as a timestamp or ID column where new, ever-increasing values are continuously being added. Histograms only reflect the range of data at the time the histograms were last updated, so new values outside that range are not reflected in the histograms. This leads to inaccurate estimates for queries that filter for this recent data. For these types of columns, leaving histograms disabled, which means sampling is used instead, typically leads to the best results.
Also, since histograms are one-dimensional, they cannot capture correlation between columns. Sampling can estimate predicates that feature a high degree of correlation between columns more accurately than histograms.
Finally, sampling may be inaccurate on predicates that are highly selective, or that have a high degree of skew between partitions. Histograms are typically more accurate on these predicates.
Plan Invalidation
When data statistics change, an old query plan optimized based on the old data statistics may no longer be the best for the new data statistics. It is often desirable to re-optimize the query plan after the data statistics change significantly, as a new plan may be significantly better than the old plan. Running ANALYZE
triggers re-optimization of any plans based on statistics for that table that are significantly different than the current statistics, by invalidating each such plan and recompiling the corresponding query the next time it is run.
Plans can be invalidated manually by running the ANALYZE
command. To automatically invalidate the plan as autostats update, set the enable_background_plan_invalidation variable to ON
. By default, this variable is set to OFF
to prevent unexpected recompilation of plans when data changes.
SingleStore uses changes in the table rowcount as a heuristic for significant changes in statistics: a plan is eligible for invalidation if there is a table rowcount differing by a factor of 2 between the statistics it was compiled with and the current statistics, i.e. if an old rowcount (from the statistics at the time the query was compiled) is less than 50% or greater than 200% of the current rowcount. When you run ANALYZE
, all plans eligible for invalidation (i.e. those where the current statistics and compile-time statistics differ in any table rowcount by at least 2x) are invalidated, i.e. they will be reoptimized and recompiled the next time they are run.
In addition, if autostats on a table were outdated when the plan was compiled, the plan will be eligible for invalidation after autostats become up-to-date (i.e. after autostats become up-to-date and you run ANALYZE
, the plan will be invalidated).
Also, running SHOW PLANCACHE
will display any existing plans. If there are any plans you wish to invalidate, running DROP <plan_id> FROM PLANCACHE
will invalidate that plan. If you run DROP ALL FROM PLANCACHE
that will drop all valid plans. See DROP … FROM PLANCACHE for more details.