Distinct Count Estimation Functions
On this page
Using Distinct Count Estimation Functions
Imagine that you work at an airline that is marketing air travel to families.
You run your calculation on the passenger_
table, which has the columns flight_
, flight_
, passenger_
and family_
.family_
.
To find the unique family total, you run the query SELECT COUNT(DISTINCT family_
.
The code below creates the passenger_
table and populates it with sample data.
DELIMITER //CREATE PROCEDURE populate_booking_table(tbl VARCHAR(40), number_of_records INT,_flight_date DATE, _family_id INT) ASBEGINFOR i IN 1..number_of_records LOOP/* This example doesn't populate the booking table with aflight number or a passenger ID; these values aren't needed todemonstrate the distinct count estimate. */INSERT INTO passenger_booking (flight_date, family_id)VALUES (_flight_date, _family_id);END LOOP;END//DELIMITER ;/* The following procedure creates the passenger_booking table and populatesit with sample values for four days. The procedure intentionally populatesfamily_id with duplicate values to demonstrate the distinct count estimatelater. */DELIMITER //CREATE PROCEDURE create_sample_data() ASBEGINCREATE TABLE passenger_booking (flight_number INT, flight_date DATE,passenger_id INT, family_id INT);FOR family_id IN 1..10000 LOOPCALL populate_booking_table('passenger_booking', 1,TO_DATE('01/01/2018','MM/DD/YYYY'), family_id);END LOOP;FOR family_id IN 5001..10000 LOOPCALL populate_booking_table('passenger_booking', 2,TO_DATE('01/01/2018','MM/DD/YYYY'), family_id);END LOOP;FOR family_id IN 8001..18000 LOOPCALL populate_booking_table('passenger_booking', 2,TO_DATE('01/02/2018','MM/DD/YYYY'), family_id);END LOOP;FOR family_id IN 15001..35000 LOOPCALL populate_booking_table('passenger_booking', 3,TO_DATE('01/03/2018','MM/DD/YYYY'), family_id);END LOOP;FOR family_id IN 30001..50000 LOOPCALL populate_booking_table('passenger_booking', 2,TO_DATE('01/04/2018','MM/DD/YYYY'), family_id);END LOOP;END//DELIMITER ;CALL create_sample_data();
Find family_
’s distinct count estimate:
SELECT COUNT(DISTINCT family_id) AS result FROM passenger_booking;
+--------+
| result |
+--------+
| 50000 |
+--------+
Getting a Distinct Count Estimate (Method 1)
Calculating the distinct count of a large column can be time-consuming and memory-intensive.family_
’s distinct count:
SELECT APPROX_COUNT_DISTINCT(family_id) AS result FROM passenger_booking;
+--------+
| result |
+--------+
| 49933 |
+--------+
APPROX_
is implemented using the HyperLogLog algorithm.
As shown in the following table, family_
’s estimate calculation runs faster and consumes less memory, as compared to the exact calculation.
Query |
Execution Time (After query compilation) |
Memory Use (After query compilation) |
---|---|---|
|
|
|
|
|
|
Getting a Distinct Count Estimate (Method 2)
Suppose that every day, you want to estimate the number of unique family_
s for flights booked up to and including that date.APPROX_
to recalculate the count for all family_
values in the passenger_
table, you call the functions APPROX_
and APPROX_
.APPROX_
.
In general, you work with the low-level functions as follows: You call APPROX_
multiple times.APPROX_
to provide an estimate from the collection of input states.
The following code snippets illustrate how to use the low-level functions to generate incremental family_
estimates.
First, create a state table that the low-level functions will use.
CREATE TABLE distinct_family_day_states (flight_date DATE,state VARBINARY(16384));
Important
Use the data type VARBINARY(16384)
to store states that you use with the low-level APPROX_
functions.
At the end of the first day, create a state for that day and insert it into the table containing the states:
INSERT INTO distinct_family_day_states(flight_date, state)SELECT flight_date, APPROX_COUNT_DISTINCT_ACCUMULATE(family_id)FROM passenger_bookingWHERE flight_date = TO_DATE('01/01/2018', 'MM/DD/YYYY')GROUP BY flight_date;
Then get the distinct family estimate for the first day:
SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state)FROM distinct_family_day_states;
+----------------------------------------+
| APPROX_COUNT_DISTINCT_ESTIMATE(state) |
+----------------------------------------+
| 10009 |
+----------------------------------------+
At the end of the second day, create a state for that day and insert it into the table containing the states:
INSERT INTO distinct_family_day_states(flight_date, state)SELECT flight_date, APPROX_COUNT_DISTINCT_ACCUMULATE(family_id)FROM passenger_bookingWHERE flight_date = TO_DATE('01/02/2018', 'MM/DD/YYYY')GROUP BY flight_date;
Then get the distinct family estimate for the first and second day:
SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state)FROM distinct_family_day_states;
+----------------------------------------+
| APPROX_COUNT_DISTINCT_ESTIMATE(state) |
+----------------------------------------+
| 18227 |
+----------------------------------------+
To find the distinct family estimate for the third and fourth day, repeat the steps above.
Distinct family estimate for the third day |
|
Distinct family estimate for the fourth day |
|
Getting a Distinct Count Estimate (Method 3): Using APPROX_ COUNT_ DISTINCT_ COMBINE
In the previous section, you found a daily rolling estimate of the number of families who booked flights.
-
Call
APPROX_
to merge the daily states into a monthly state.COUNT_ DISTINCT_ COMBINE Insert the monthly state into a monthly states table. The code below shows how to create the monthly state for January 2018. Note that a WHERE
clause is not needed in theSELECT
statement; the daily states table contains data for one month only.
INSERT INTO distinct_family_month_statesSELECT 1, 2018, APPROX_COUNT_DISTINCT_COMBINE(state)FROM distinct_family_day_states;
-
Truncate the
distinct_
table.family_ day_ states Prior to truncating, the table should not contain daily states for any other month.
At the end of each day, insert the states from the distinct_
and distinct_
tables into a new temporary result set.APPROX_
on the states in the result set:
WITH all_states AS (SELECT state FROM distinct_family_day_statesUNION ALLSELECT state FROM distinct_family_month_states)SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state) FROM all_states;
In this section
Last modified: July 29, 2024