Statistics and Sampling Concepts
On this page
Introduction
SingleStore automatically gathers statistics to give the optimizer the information it needs.
It’s almost always best to leave autostats enabled.ANALYZE command.
The ANALYZE command also can be used to trigger plan re-optimization when data statistics change, whether or not autostats is enabled.
Autostat Types
SingleStore automatically collects and updates column statistics on columnstore tables, incrementally in the background.ANALYZE, ensures they are available without requiring user action, and keeps them up-to-date as table data changes with less performance impact.
Autostats provides automatic maintenance of column statistics.INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE, as well as ALTER TABLE.
While statistics are updated automatically, to re-optimize query plans after statistics change, you must trigger plan invalidation by running ANALYZE.
Running ANALYZE also saves a copy of the table’s current automatic statistics.ANALYZE instead.
SingleStore uses three types of autostats:
|
Column Statistics Methods |
Histogram Methods |
Sampling Methods | |
|---|---|---|---|
|
Columnstore |
Incremental or Periodic |
Create or Update |
On or Off |
|
Rowstore |
Periodic only |
Create or Update |
On only |
Column Statistics
|
Memory Usage |
Best Use Case |
Can Disable | |
|---|---|---|---|
|
Columnstore |
Incremental: 7kb per column per partition Periodic: ~150 bytes per column per node |
|
Yes |
|
Rowstore |
~150 bytes per column per node |
|
Yes |
Column statistics are collected on every column in a table when the table is processed by autostats or the ANALYZE command.
You can view the column statistics last collected by querying the information_ table or DESCRIBE INFORMATION_
+-----------------------+--------------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME | varchar(512) | NO | | NULL | |
| TABLE_NAME | varchar(512) | NO | | NULL | |
| COLUMN_NAME | varchar(512) | NO | | NULL | |
| ROW_COUNT | bigint(21) | YES | | NULL | |
| NULL_COUNT | bigint(21) | YES | | NULL | |
| CARDINALITY | bigint(21) | YES | | NULL | |
| AUTOSTATS_ENABLED | tinyint(1) | YES | | NULL | |
| AUTOSTATS_OUTDATED | tinyint(1) | YES | | NULL | |
| AUTOSTATS_ROW_COUNT | bigint(21) | YES | | NULL | |
| AUTOSTATS_NULL_COUNT | bigint(21) | YES | | NULL | |
| AUTOSTATS_CARDINALITY | bigint(21) | YES | | NULL | |
| ADVANCED_HISTOGRAMS | bigint(21) | YES | | NULL | |
| LEGACY_HISTOGRAMS | bigint(21) | YES | | NULL | |
| RANGE_STATS | bigint(21) | YES | | NULL | |
| SAMPLE_SIZE | bigint(21) | YES | | NULL | |
| LAST_UPDATED | datetime | YES | | NULL | |
+-----------------------+--------------+------+------+---------+-------+AUTOSTATS_ indicates if cardinality autostats are enabled.LAST_ column shows the time non-incremental statistics were saved.ANALYZE runs, or on periodic autostats collection.
Note
information_ is not accurate for columnstore tables using incremental autostats.ANALYZE.
The best way to check that incremental autostats are up to date is to run EXPLAIN on the table in question.
The RANGE_ column indicates whether any kind of histogram is stored for this column.LEGACY_ and ADVANCED_ columns indicate presence of legacy and new histograms, respectively.cardinality_ set to 6.RANGE_.
The cardinality_ engine variable controls the behavior of histograms and sampling used for selectivity estimation, which is an important part of query optimization.
-
6.- Use legacy histograms.0 -
6.- Use advanced histograms.5 -
7.- Same behavior as0 6..5 -
7.- Use histograms for join estimation.1 -
7.- Use sampling and histograms together (when both are available), to improve selectivity estimation.3 When query processing distributed joins, the optimizer will consider initiating joining from the leftmost derived table. Please note, that the behavior of the engine variable
verify_(which provides control over how join predicates are applied) is dependent on thefields_ in_ transitivity cardinality_setting.estimation_ level When the cardinality_is set at 7.estimation_ level 3, the behavior of the verify_variable is interpreted asfields_ in_ transitivity ON.When the cardinality_is set belowestimation_ level 7., the behavior of the3 verify_variable is interpreted asfields_ in_ transitivity OFF.
AUTOSTATS_, AUTOSTATS_, and AUTOSTATS_ are deprecated.
Columns in information_ will contain NULL values if autostats haven’t yet been collected, or autostats is disabled and ANALYZE has not yet been run.
Method of Cardinality Autostat Collection
Cardinality autostats can be collected in two ways: incremental or periodic.
-
Incremental - Incremental autostats are supported only on columnstore tables.
The incremental autostats gathering process updates the column statistics of a table incrementally as the table’s data changes. Incremental autostats are enabled by default on all columnstore tables. -
Periodic - Periodic autostats are supported on both columnstore and rowstore tables.
Periodic autostats collects column statistics on a table over a fixed period, but only after the table’s data has changed significantly. To collect these statistics, the SingleStore engine automatically runs ANALYZEon the table, in the background.As ANALYZEcan be an expensive operation, total resource usage by periodic autostats is limited according to thesystem_variable.optimizer_ pool
Histogram or Range Statistics
|
Memory Usage |
Best Use Case |
Can Disable | |
|---|---|---|---|
|
Columnstore |
For each column with histograms enabled: On each node, ~2KB for int/double types and ~8KB for string types |
Range filter that does not increase; e. |
Yes |
|
Rowstore |
For each column with histograms enabled: On each node, ~2KB for int/double types and ~8KB for string types |
Range filter that does not increase; e. |
Yes |
The ANALYZE command can additionally collect histograms over designated columns.ANALYZE will sample the rows in that column and create a histogram from the samples, recording the endpoints of each bucket and statistical information about the distribution of values within each bucket.
When collected, histograms can be queried through the information_ table or DESCRIBE INFORMATION_.
+-------------------------+---------------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------------+---------------+------+------+---------+-------+
| DATABASE_NAME | varchar(512) | NO | | NULL | |
| TABLE_NAME | varchar(512) | NO | | NULL | |
| COLUMN_NAME | varchar(512) | NO | | NULL | |
| TYPE | varchar(255) | NO | | NULL | |
| VERSION | bigint(21) | NO | | NULL | |
| BUCKET_COUNT | bigint(21) | NO | | NULL | |
| BUCKET_INDEX | bigint(21) | NO | | NULL | |
| RANGE_MIN | varchar(8192) | YES | | NULL | |
| RANGE_MAX | varchar(8192) | YES | | NULL | |
| UNIQUE_COUNT | double | YES | | NULL | |
| CARDINALITY | double | YES | | NULL | |
| UNIQUE_COUNT_CUMULATIVE | double | YES | | NULL | |
| CARDINALITY_CUMULATIVE | double | YES | | NULL | |
+-------------------------+---------------+------+------+---------+-------+The table will display, in human-readable form, the end points of the histogram buckets along with total and unique count for each bucket.
If you want to find a list of the columns that have histograms, you can use the following query:
SELECT DISTINCT DATABASE_NAME, TABLE_NAME, COLUMN_NAMEFROM information_schema.ADVANCED_HISTOGRAMS;
Method of Histogram Autostat Collection
Histogram autostats can be collected in two ways: create or update.
-
Create - With
AUTOSTATS_set toHISTOGRAM_ MODE CREATE, if a column is identified at least twice ininformation_that having a histogram on it could be beneficial, then the histogram will be automatically created, and updated at that time.schema. mv_ prospective_ histograms Like any histograms you create, it is updated when you run ANALYZE, and also as part of the periodic autostats (same interval: 10 minutes, and actually collected as part of the same process). -
Update - With
AUTOSTATS_set toHISTOGRAM_ MODE UPDATE, the existing histograms are kept current, but new histograms are not created.
Sampling
|
Memory Usage |
Best Use Case |
Can Disable | |
|---|---|---|---|
|
Columnstore |
~0. |
Not an issue since SingleStore separately stores a small percentage of data from tables in a separate internal table called the sampling table. |
Yes |
|
Rowstore |
No extra cost |
Sampling can't be turned off. |
No |
Columnstore and rowstore tables both can be sampled by the query optimizer.
Rowstore tables are sampled directly.
Method of Sampling Autostat Collection
Sampling is either enabled or disabled for columnstore tables, via the default_ engine variable.
Histograms vs Sampling
If histograms are unavailable, the optimizer will use a sample to compute estimates.
Histograms are much faster to use than sampling during query optimization, since the optimizer merely needs to read the previously-collected histograms, which are relatively small objects, while sampling may require scanning a large amount of table data.
Whether histogram-based estimates or sampling-based estimates are more accurate depends on the table and query.ANALYZE TABLE.
Conversely, histograms are least effective on columns where the data distribution is quickly and continuously changing.
Also, since histograms are one-dimensional, they cannot capture correlation between columns.
Finally, sampling may be inaccurate on predicates that are highly selective, or that have a high degree of skew between partitions.
Plan Invalidation
When data statistics change, an old query plan optimized based on the old data statistics may no longer be the best for the new data statistics.ANALYZE triggers re-optimization of any plans based on statistics for that table that are significantly different than the current statistics, by invalidating each such plan and recompiling the corresponding query the next time it is run.
Plans can be invalidated manually by running the ANALYZE command.ON.OFF to prevent unexpected recompilation of plans when data changes.
SingleStore uses changes in the table rowcount as a heuristic for significant changes in statistics: a plan is eligible for invalidation if there is a table rowcount differing by a factor of 2 between the statistics it was compiled with and the current statistics, i.ANALYZE, all plans eligible for invalidation (i.
In addition, if autostats on a table were outdated when the plan was compiled, the plan will be eligible for invalidation after autostats become up-to-date (i.ANALYZE, the plan will be invalidated).
Also, running SHOW PLANCACHE will display any existing plans.DROP <plan_ will invalidate that plan.DROP ALL FROM PLANCACHE that will drop all valid plans.
Last modified: July 2, 2025