Statistics and Sampling Concepts
On this page
Introduction
SingleStore automatically gathers statistics to give the optimizer the information it needs.
It’s almost always best to leave autostats enabled.ANALYZE
command.
The ANALYZE
command also can be used to trigger plan re-optimization when data statistics change, whether or not autostats is enabled.
Autostat Types
SingleStore automatically collects and updates column statistics on columnstore tables, incrementally in the background.ANALYZE
, ensures they are available without requiring user action, and keeps them up-to-date as table data changes with less performance impact.
Autostats provides automatic maintenance of column statistics.INSERT
, UPDATE
, DELETE
, LOAD DATA
, and TRUNCATE
, as well as ALTER TABLE
.
While statistics are updated automatically, to re-optimize query plans after statistics change, you must trigger plan invalidation by running ANALYZE
.
Running ANALYZE
also saves a copy of the table’s current automatic statistics.ANALYZE
instead.
SingleStore uses three types of autostats:
Column Statistics Methods |
Histogram Methods |
Sampling Methods |
|
---|---|---|---|
Columnstore |
Incremental or Periodic |
Create or Update |
On or Off |
Rowstore |
Periodic only |
Create or Update |
On only |
Column Statistics
Memory Usage |
Best Use Case |
Can Disable |
|
---|---|---|---|
Columnstore |
Incremental: 7kb per column per partition Periodic: ~150 bytes per column per node |
|
Yes |
Rowstore |
~150 bytes per column per node |
|
Yes |
Column statistics are collected on every column in a table when the table is processed by autostats or the ANALYZE
command.
You can view the column statistics last collected by querying the information_
table or DESCRIBE INFORMATION_
+-----------------------+--------------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME | varchar(512) | NO | | NULL | |
| TABLE_NAME | varchar(512) | NO | | NULL | |
| COLUMN_NAME | varchar(512) | NO | | NULL | |
| ROW_COUNT | bigint(21) | YES | | NULL | |
| NULL_COUNT | bigint(21) | YES | | NULL | |
| CARDINALITY | bigint(21) | YES | | NULL | |
| AUTOSTATS_ENABLED | tinyint(1) | YES | | NULL | |
| AUTOSTATS_OUTDATED | tinyint(1) | YES | | NULL | |
| AUTOSTATS_ROW_COUNT | bigint(21) | YES | | NULL | |
| AUTOSTATS_NULL_COUNT | bigint(21) | YES | | NULL | |
| AUTOSTATS_CARDINALITY | bigint(21) | YES | | NULL | |
| ADVANCED_HISTOGRAMS | bigint(21) | YES | | NULL | |
| LEGACY_HISTOGRAMS | bigint(21) | YES | | NULL | |
| RANGE_STATS | bigint(21) | YES | | NULL | |
| SAMPLE_SIZE | bigint(21) | YES | | NULL | |
| LAST_UPDATED | datetime | YES | | NULL | |
+-----------------------+--------------+------+------+---------+-------+
AUTOSTATS_
indicates if cardinality autostats are enabled.LAST_
column shows the time non-incremental statistics were saved.ANALYZE
runs, or on periodic autostats collection.
Note
information_
is not accurate for columnstore tables using incremental autostats.ANALYZE
.
The best way to check that incremental autostats are up to date is to run EXPLAIN on the table in question.
The RANGE_
column indicates whether any kind of histogram is stored for this column.LEGACY_
and ADVANCED_
columns indicate presence of legacy and new histograms, respectively.cardinality_
set to 6.RANGE_
.
The cardinality_
engine variable controls the behavior of histograms and sampling used for selectivity estimation, which is an important part of query optimization.
-
6.
- Use legacy histograms.0 -
6.
- Use advanced histograms.5 -
7.
- Same behavior as0 6.
.5 -
7.
- Use histograms for join estimation.1 -
7.
- Use sampling and histograms together (when both are available), to improve selectivity estimation.3 When query processing distributed joins, the optimizer will consider initiating joining from the leftmost derived table. Please note, that the behavior of the engine variable
verify_
(which provides control over how join predicates are applied) is dependent on thefields_ in_ transitivity cardinality_
setting.estimation_ level When the cardinality_
is set at 7.estimation_ level 3, the behavior of the verify_
variable is interpreted asfields_ in_ transitivity ON
.When the cardinality_
is set belowestimation_ level 7.
, the behavior of the3 verify_
variable is interpreted asfields_ in_ transitivity OFF
.
AUTOSTATS_
, AUTOSTATS_
, and AUTOSTATS_
are deprecated.
Columns in information_
will contain NULL
values if autostats haven’t yet been collected, or autostats is disabled and ANALYZE
has not yet been run.
Method of Cardinality Autostat Collection
Cardinality autostats can be collected in two ways: incremental or periodic.
-
Incremental - Incremental autostats are supported only on columnstore tables.
The incremental autostats gathering process updates the column statistics of a table incrementally as the table’s data changes. Incremental autostats are enabled by default on all columnstore tables. -
Periodic - Periodic autostats are supported on both columnstore and rowstore tables.
Periodic autostats collects column statistics on a table over a fixed period, but only after the table’s data has changed significantly. To collect these statistics, the SingleStore engine automatically runs ANALYZE
on the table, in the background.As ANALYZE
can be an expensive operation, total resource usage by periodic autostats is limited according to thesystem_
variable.optimizer_ pool
Histogram or Range Statistics
Memory Usage |
Best Use Case |
Can Disable |
|
---|---|---|---|
Columnstore |
For each column with histograms enabled: On each node, ~2KB for int/double types and ~8KB for string types |
Range filter that does not increase; e. |
Yes |
Rowstore |
For each column with histograms enabled: On each node, ~2KB for int/double types and ~8KB for string types |
Range filter that does not increase; e. |
Yes |
The ANALYZE
command can additionally collect histograms over designated columns.ANALYZE
will sample the rows in that column and create a histogram from the samples, recording the endpoints of each bucket and statistical information about the distribution of values within each bucket.
When collected, histograms can be queried through the information_
table or DESCRIBE INFORMATION_
.
+-------------------------+---------------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------------+---------------+------+------+---------+-------+
| DATABASE_NAME | varchar(512) | NO | | NULL | |
| TABLE_NAME | varchar(512) | NO | | NULL | |
| COLUMN_NAME | varchar(512) | NO | | NULL | |
| TYPE | varchar(255) | NO | | NULL | |
| VERSION | bigint(21) | NO | | NULL | |
| BUCKET_COUNT | bigint(21) | NO | | NULL | |
| BUCKET_INDEX | bigint(21) | NO | | NULL | |
| RANGE_MIN | varchar(8192) | YES | | NULL | |
| RANGE_MAX | varchar(8192) | YES | | NULL | |
| UNIQUE_COUNT | double | YES | | NULL | |
| CARDINALITY | double | YES | | NULL | |
| UNIQUE_COUNT_CUMULATIVE | double | YES | | NULL | |
| CARDINALITY_CUMULATIVE | double | YES | | NULL | |
+-------------------------+---------------+------+------+---------+-------+
The table will display, in human-readable form, the end points of the histogram buckets along with total and unique count for each bucket.
If you want to find a list of the columns that have histograms, you can use the following query:
SELECT DISTINCT DATABASE_NAME, TABLE_NAME, COLUMN_NAMEFROM information_schema.ADVANCED_HISTOGRAMS;
Method of Histogram Autostat Collection
Histogram autostats can be collected in two ways: create or update.
-
Create - With
AUTOSTATS_
set toHISTOGRAM_ MODE CREATE
, if a column is identified at least twice ininformation_
that having a histogram on it could be beneficial, then the histogram will be automatically created, and updated at that time.schema. mv_ prospective_ histograms Like any histograms you create, it is updated when you run ANALYZE
, and also as part of the periodic autostats (same interval: 10 minutes, and actually collected as part of the same process). -
Update - With
AUTOSTATS_
set toHISTOGRAM_ MODE UPDATE
, the existing histograms are kept current, but new histograms are not created.
Sampling
Memory Usage |
Best Use Case |
Can Disable |
|
---|---|---|---|
Columnstore |
~0. |
Not an issue since SingleStore separately stores a small percentage of data from tables in a separate internal table called the sampling table. |
Yes |
Rowstore |
No extra cost |
Sampling can't be turned off. |
No |
Columnstore and rowstore tables both can be sampled by the query optimizer.
Rowstore tables are sampled directly.
Method of Sampling Autostat Collection
Sampling is either enabled or disabled for columnstore tables, via the default_
engine variable.
For columnstore tables, the method of sampling is configured via the sampling_
engine variable, which has four possible modes:
-
auto
- This is the default mode, and it is the behavior that SingleStore has used historically.A separate sample table is maintained at each leaf node, and these are queried by the aggregator node. Due to the shape of the query (filter predicates are inside an aggregator function) used to determine the results, this can sometimes be expensive and cause significant load on the leaf nodes. -
aggregation
- This is the same behavior asauto
.(This behavior may change in future releases. ) -
lightweight
- This option only dispatches a query to evaluate the numerator (the number of rows that passes through the filter predicate), and uses a "good enough" value for the denominator (the total number of rows in the table) that has been previously computed.In this case, the filter predicate can be pushed down to the operator tree as it is not contained within a aggregator function, and this should result in a faster query. A potential downside is that this query does not compute the denominator. Since SingleStore uses an estimate for the denominator, the selectivity value computed from this option can be inaccurate and can result in suboptimal plans. Try this option if a significant performance issue is observed from the
auto
oraggregation
sampling query. -
filtered
- If thelightweight
option does not provide accurate selectivity values, try this option, which uses two table scans to evaluate both accurate numerator and denominator, and joins the results in one query.Similar to the lightweight
option, the filter predicate is with theWHERE
(not in an aggregator function) and should perform better in theory.The potential downside of this method is the overhead from the additional table scan, which is essentially running another query.
Histograms vs Sampling
If histograms are unavailable, the optimizer will use a sample to compute estimates.
Histograms are much faster to use than sampling during query optimization, since the optimizer merely needs to read the previously-collected histograms, which are relatively small objects, while sampling may require scanning a large amount of table data.
Whether histogram-based estimates or sampling-based estimates are more accurate depends on the table and query.ANALYZE TABLE
.
Conversely, histograms are least effective on columns where the data distribution is quickly and continuously changing.
Also, since histograms are one-dimensional, they cannot capture correlation between columns.
Finally, sampling may be inaccurate on predicates that are highly selective, or that have a high degree of skew between partitions.
Plan Invalidation
When data statistics change, an old query plan optimized based on the old data statistics may no longer be the best for the new data statistics.ANALYZE
triggers re-optimization of any plans based on statistics for that table that are significantly different than the current statistics, by invalidating each such plan and recompiling the corresponding query the next time it is run.
Plans can be invalidated manually by running the ANALYZE
command.ON
.OFF
to prevent unexpected recompilation of plans when data changes.
SingleStore uses changes in the table rowcount as a heuristic for significant changes in statistics: a plan is eligible for invalidation if there is a table rowcount differing by a factor of 2 between the statistics it was compiled with and the current statistics, i.ANALYZE
, all plans eligible for invalidation (i.
In addition, if autostats on a table were outdated when the plan was compiled, the plan will be eligible for invalidation after autostats become up-to-date (i.ANALYZE
, the plan will be invalidated).
Also, running SHOW PLANCACHE
will display any existing plans.DROP <plan_
will invalidate that plan.DROP ALL FROM PLANCACHE
that will drop all valid plans.
Correlation Statistics
Correlated columns mean the value of one column depends on the value of another column to some degree.
Optimizing queries containing correlated columns can be challenging.
Consider a scenario where 30% of a company's customers reside in California, resulting in a 30% selectivity when filtering by state = "CA.
Optimizing the query plan can be achieved by using statistics that account for correlated column relationships and dependencies within a query predicate.
The optimizer defaults the correlation coefficient to 0.
Consider the tables and query:
CREATE TABLE inventory (id INT, model_year INT, make VARCHAR(10), model VARCHAR(20), SHARD KEY(id), SORT KEY(model));
CREATE TABLE sales (s_id INT PRIMARY KEY,i_id INT, sales_month FLOAT, monthly_sales FLOAT,yearly_sales AS (monthly_sales * 12) PERSISTED FLOAT, SORT KEY(sales_month));
SELECT inventory.model_year, inventory.make, inventory.model, sales.sales_month, sales.montly_salesFROM inventory JOIN sales ON inventor.id = sales.i_idWHERE inventory.model = 'Honda' AND inventory.make = 'Accord';
In this case, the optimizer should disregard the inventory.
filter in estimations as it is functionally determined by inventory.
and use the correlation statistics functionality to correlate inventory.
and inventory.
with a coefficient of 1.
To set the correlation coefficient, run the following command.
ANALYZE TABLE inventory CORRELATE COLUMN inventory.model WITH COLUMN inventory.make USING COEFFICIENT 1.0;
To view the correlation coefficient set between two columns on a table, run the information_
statement:
SELECT * FROM information_schema.correlated_column_statistics;
+---------------+------------+------------------------+-----------------------+-------------------|---------------------+
| DATABASE_NAME | TABLE_NAME | CORRELATED_COLUMN_NAME | CORRELLEE_COLUMN_NAME | CORRELATION_COEFF | CREATED_TIME |
+---------------+------------+------------------------+-----------------------+-------------------+---------------------+
| test1 | inventory | model | make | 1 | 2023-12-07 22:02:31 |
+---------------+------------+------------------------+-----------------------+-------------------+---------------------+
Last modified: August 6, 2024