Using Distinct Count Estimation Functions
On this page
Imagine that you work at an airline that is marketing air travel to families.
You run your calculation on the passenger_
table, which has the columns flight_
, flight_
, passenger_
and family_
.family_
.
To find the unique family total, you run the query SELECT COUNT(DISTINCT family_
.
The code below creates the passenger_
table and populates it with sample data.
DELIMITER //CREATE PROCEDURE populate_booking_table(tbl VARCHAR(40), number_of_records INT,_flight_date DATE, _family_id INT) ASBEGINFOR i IN 1..number_of_records LOOP/* This example doesn't populate the booking table with aflight number or a passenger ID; these values aren't needed todemonstrate the distinct count estimate. */INSERT INTO passenger_booking (flight_date, family_id)VALUES (_flight_date, _family_id);END LOOP;END//DELIMITER ;/* The following procedure creates the passenger_booking table and populatesit with sample values for four days. The procedure intentionally populatesfamily_id with duplicate values to demonstrate the distinct count estimatelater. */DELIMITER //CREATE PROCEDURE create_sample_data() ASBEGINCREATE TABLE passenger_booking (flight_number INT, flight_date DATE,passenger_id INT, family_id INT);FOR family_id IN 1..10000 LOOPCALL populate_booking_table('passenger_booking', 1,TO_DATE('01/01/2018','MM/DD/YYYY'), family_id);END LOOP;FOR family_id IN 5001..10000 LOOPCALL populate_booking_table('passenger_booking', 2,TO_DATE('01/01/2018','MM/DD/YYYY'), family_id);END LOOP;FOR family_id IN 8001..18000 LOOPCALL populate_booking_table('passenger_booking', 2,TO_DATE('01/02/2018','MM/DD/YYYY'), family_id);END LOOP;FOR family_id IN 15001..35000 LOOPCALL populate_booking_table('passenger_booking', 3,TO_DATE('01/03/2018','MM/DD/YYYY'), family_id);END LOOP;FOR family_id IN 30001..50000 LOOPCALL populate_booking_table('passenger_booking', 2,TO_DATE('01/04/2018','MM/DD/YYYY'), family_id);END LOOP;END//DELIMITER ;CALL create_sample_data();
Find family_
’s distinct count estimate:
SELECT COUNT(DISTINCT family_id) AS result FROM passenger_booking;
+--------+
| result |
+--------+
| 50000 |
+--------+
Getting a Distinct Count Estimate (Method 1)
Calculating the distinct count of a large column can be time-consuming and memory-intensive.family_
’s distinct count:
SELECT APPROX_COUNT_DISTINCT(family_id) AS result FROM passenger_booking;
+--------+
| result |
+--------+
| 49933 |
+--------+
APPROX_
is implemented using the HyperLogLog algorithm.
As shown in the following table, family_
’s estimate calculation runs faster and consumes less memory, as compared to the exact calculation.
Query |
Execution Time (After query compilation) |
Memory Use (After query compilation) |
---|---|---|
|
|
|
|
|
|
Getting a Distinct Count Estimate (Method 2)
Suppose that every day, you want to estimate the number of unique family_
s for flights booked up to and including that date.APPROX_
to recalculate the count for all family_
values in the passenger_
table, you call the functions APPROX_
and APPROX_
.APPROX_
.
In general, you work with the low-level functions as follows: You call APPROX_
multiple times.APPROX_
to provide an estimate from the collection of input states.
The following code snippets illustrate how to use the low-level functions to generate incremental family_
estimates.
First, create a state table that the low-level functions will use.
CREATE TABLE distinct_family_day_states (flight_date DATE,state VARBINARY(16384));
Important
Use the data type VARBINARY(16384)
to store states that you use with the low-level APPROX_
functions.
At the end of the first day, create a state for that day and insert it into the table containing the states:
INSERT INTO distinct_family_day_states(flight_date, state)SELECT flight_date, APPROX_COUNT_DISTINCT_ACCUMULATE(family_id)FROM passenger_bookingWHERE flight_date = TO_DATE('01/01/2018', 'MM/DD/YYYY')GROUP BY flight_date;
Then get the distinct family estimate for the first day:
SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state)FROM distinct_family_day_states;
+----------------------------------------+
| APPROX_COUNT_DISTINCT_ESTIMATE(state) |
+----------------------------------------+
| 10009 |
+----------------------------------------+
At the end of the second day, create a state for that day and insert it into the table containing the states:
INSERT INTO distinct_family_day_states(flight_date, state)SELECT flight_date, APPROX_COUNT_DISTINCT_ACCUMULATE(family_id)FROM passenger_bookingWHERE flight_date = TO_DATE('01/02/2018', 'MM/DD/YYYY')GROUP BY flight_date;
Then get the distinct family estimate for the first and second day:
SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state)FROM distinct_family_day_states;
+----------------------------------------+
| APPROX_COUNT_DISTINCT_ESTIMATE(state) |
+----------------------------------------+
| 18227 |
+----------------------------------------+
To find the distinct family estimate for the third and fourth day, repeat the steps above.
Distinct family estimate for the third day |
|
Distinct family estimate for the fourth day |
|
Getting a Distinct Count Estimate (Method 3): Using APPROX_ COUNT_ DISTINCT_ COMBINE
In the previous section, you found a daily rolling estimate of the number of families who booked flights.
-
Call
APPROX_
to merge the daily states into a monthly state.COUNT_ DISTINCT_ COMBINE Insert the monthly state into a monthly states table. The code below shows how to create the monthly state for January 2018. Note that a WHERE
clause is not needed in theSELECT
statement; the daily states table contains data for one month only.
INSERT INTO distinct_family_month_statesSELECT 1, 2018, APPROX_COUNT_DISTINCT_COMBINE(state)FROM distinct_family_day_states;
-
Truncate the
distinct_
table.family_ day_ states Prior to truncating, the table should not contain daily states for any other month.
At the end of each day, insert the states from the distinct_
and distinct_
tables into a new temporary result set.APPROX_
on the states in the result set:
WITH all_states AS (SELECT state FROM distinct_family_day_statesUNION ALLSELECT state FROM distinct_family_month_states)SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state) FROM all_states;
Last modified: April 22, 2024