# Distinct Count Estimation Functions

## Using Distinct Count Estimation Functions

Imagine that you work at an airline that is marketing air travel to families. As a data analyst, you would like to count the number of families who have booked flights. In your total, you count a family if two or more passengers belonging to that family have booked a flight on the airline. The passengers do not need to be traveling on the same flight. Also, if a family has booked more than one flight, you count that family only once.

You run your calculation on the `passenger_booking` table, which has the columns `flight_number`, `flight_date`, `passenger_id` and `family_id`. Passengers who belong to the same family share the same `family_id`. The table does not include bookings for passengers traveling alone.

To find the unique family total, you run the query `SELECT COUNT(DISTINCT family_id) AS result FROM passenger_booking`.

The code below creates the `passenger_booking` table and populates it with sample data.

```sql
DELIMITER //
CREATE PROCEDURE populate_booking_table(tbl VARCHAR(40), number_of_records INT,
  _flight_date DATE, _family_id INT) AS
BEGIN
  FOR i IN 1..number_of_records LOOP
    /* This example doesn't populate the booking table with a
       flight number or a passenger ID; these values aren't needed to
       demonstrate the distinct count estimate. */    
    INSERT INTO passenger_booking (flight_date, family_id)
      VALUES (_flight_date, _family_id);
  END LOOP;
END
//
DELIMITER ;

/* The following procedure creates the passenger_booking table and populates
   it with sample values for four days. The procedure intentionally populates
   family_id with duplicate values to demonstrate the distinct count estimate
   later. */

DELIMITER //
CREATE PROCEDURE create_sample_data() AS
BEGIN
  CREATE TABLE passenger_booking (flight_number INT, flight_date DATE,
    passenger_id INT, family_id INT);

  FOR family_id IN 1..10000 LOOP
    CALL populate_booking_table('passenger_booking', 1,
      TO_DATE('01/01/2018','MM/DD/YYYY'), family_id);
  END LOOP;
  FOR family_id IN 5001..10000 LOOP
    CALL populate_booking_table('passenger_booking', 2,
      TO_DATE('01/01/2018','MM/DD/YYYY'), family_id);
  END LOOP;
  FOR family_id IN 8001..18000 LOOP
    CALL populate_booking_table('passenger_booking', 2,
      TO_DATE('01/02/2018','MM/DD/YYYY'), family_id);
  END LOOP;
  FOR family_id IN 15001..35000 LOOP
    CALL populate_booking_table('passenger_booking', 3,
      TO_DATE('01/03/2018','MM/DD/YYYY'), family_id);
  END LOOP;
  FOR family_id IN 30001..50000 LOOP
    CALL populate_booking_table('passenger_booking', 2,
      TO_DATE('01/04/2018','MM/DD/YYYY'), family_id);
  END LOOP;
END
//
DELIMITER ;

CALL create_sample_data();

```

Find `family_id`’s distinct count estimate:

```sql
SELECT COUNT(DISTINCT family_id) AS result FROM passenger_booking;


```

```output

+--------+
| result |
+--------+
| 50000  |
+--------+

```

## Getting a Distinct Count Estimate (Method 1)

Calculating the distinct count of a large column can be time-consuming and memory-intensive. To reduce the cost of this calculation, you run an alternative function that gets an estimate of `family_id`’s distinct count:

```sql
SELECT APPROX_COUNT_DISTINCT(family_id) AS result FROM passenger_booking;


```

```output

+--------+
| result |
+--------+
| 49933  |
+--------+

```

`APPROX_COUNT_DISTINCT` is implemented using the [HyperLogLog](http://en.wikipedia.org/wiki/HyperLogLog) algorithm.

As shown in the following table, `family_id`’s estimate calculation runs faster and consumes less memory, as compared to the exact calculation.

| Query                                                            | Execution Time (After query compilation) | Memory Use (After query compilation) |
| ---------------------------------------------------------------- | ---------------------------------------- | ------------------------------------ |
| `SELECT COUNT(DISTINCT family_id) FROM passenger_booking`        | `32`                                     | `775614`                             |
| `SELECT APPROX_COUNT_DISTINCT(family_id) FROM passenger_booking` | `9`                                      | `128453`                             |

## Getting a Distinct Count Estimate (Method 2)

Suppose that every day, you want to estimate the number of unique `family_id`s for flights booked up to and including that date. Rather than calling `APPROX_COUNT_DISTINCT` to recalculate the count for all `family_id` values in the `passenger_booking` table, you call the functions `APPROX_COUNT_DISTINCT_ACCUMULATE` and `APPROX_COUNT_DISTINCT_ESTIMATE`. These are low-level functions that allow you to obtain incremental estimates, consuming less time and memory than `APPROX_COUNT_DISTINCT`.

In general, you work with the low-level functions as follows: You call `APPROX_COUNT_DISTINCT_ACCUMULATE` multiple times. On each invocation, you provide a data subset to summarize into a state. You then call `APPROX_COUNT_DISTINCT_ESTIMATE` to provide an estimate from the collection of input states.

The following code snippets illustrate how to use the low-level functions to generate incremental `family_id` estimates.

First, create a state table that the low-level functions will use.

```sql
CREATE TABLE distinct_family_day_states (flight_date DATE,
  state VARBINARY(16384));

```

> **❗ Important**: Use the data type `VARBINARY(16384)` to store states that you use with the low-level `APPROX_COUNT_DISTINCT` functions.

At the end of the first day, create a state for that day and insert it into the table containing the states:

```sql
INSERT INTO distinct_family_day_states(flight_date, state)
  SELECT flight_date, APPROX_COUNT_DISTINCT_ACCUMULATE(family_id)
  FROM passenger_booking
  WHERE flight_date = TO_DATE('01/01/2018', 'MM/DD/YYYY')
  GROUP BY flight_date;

```

Then get the distinct family estimate for the first day:

```sql
SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state)
  FROM distinct_family_day_states;


```

```output

+----------------------------------------+
| APPROX_COUNT_DISTINCT_ESTIMATE(state)  |
+----------------------------------------+
|                                  10009 |
+----------------------------------------+  

```

At the end of the second day, create a state for that day and insert it into the table containing the states:

```sql
INSERT INTO distinct_family_day_states(flight_date, state)
  SELECT flight_date, APPROX_COUNT_DISTINCT_ACCUMULATE(family_id)
  FROM passenger_booking
  WHERE flight_date = TO_DATE('01/02/2018', 'MM/DD/YYYY')
  GROUP BY flight_date;

```

Then get the distinct family estimate for the first and second day:

```sql
SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state)
  FROM distinct_family_day_states;


```

```output

+----------------------------------------+
| APPROX_COUNT_DISTINCT_ESTIMATE(state)  |
+----------------------------------------+
|                                  18227 |
+----------------------------------------+

```

To find the distinct family estimate for the third and fourth day, repeat the steps above. The results are shown in the table below.

| Distinct family estimate for the third day  | `34856` |
| ------------------------------------------- | ------- |
| Distinct family estimate for the fourth day | `49933` |

## Getting a Distinct Count Estimate (Method 3): Using `APPROX_COUNT_DISTINCT_COMBINE`

In the [previous section](https://docs.singlestore.com/db/v9.1/reference/sql-reference/distinct-count-estimation-functions.md), you found a daily rolling estimate of the number of families who booked flights. Imagine that you want to make this estimate every day for multiple years. You can optimize this calculation by taking the following steps, in order, at the end of every month:

* Call `APPROX_COUNT_DISTINCT_COMBINE` to merge the daily states into a monthly state. Insert the monthly state into a monthly states table. The code below shows how to create the monthly state for January 2018. Note that a `WHERE` clause is not needed in the `SELECT` statement; the daily states table contains data for one month only.

```sql
INSERT INTO distinct_family_month_states
  SELECT 1, 2018, APPROX_COUNT_DISTINCT_COMBINE(state)
  FROM distinct_family_day_states;

```

* Truncate the `distinct_family_day_states` table. Prior to truncating, the table should not contain daily states for any other month.

At the end of each day, insert the states from the `distinct_family_day_states` and `distinct_family_month_states` tables into a new temporary result set. Find the distinct family count estimate by running `APPROX_COUNT_DISTINCT_ESTIMATE` on the states in the result set:

```sql
WITH all_states AS (
  SELECT state FROM distinct_family_day_states
  UNION ALL
  SELECT state FROM distinct_family_month_states
)
SELECT APPROX_COUNT_DISTINCT_ESTIMATE(state) FROM all_states;

```

## In this section

* [APPROX\_COUNT\_DISTINCT](https://docs.singlestore.com/db/v9.1/reference/sql-reference/distinct-count-estimation-functions/approx-count-distinct.md)
* [APPROX\_COUNT\_DISTINCT\_ACCUMULATE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/distinct-count-estimation-functions/approx-count-distinct-accumulate.md)
* [APPROX\_COUNT\_DISTINCT\_COMBINE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/distinct-count-estimation-functions/approx-count-distinct-combine.md)
* [APPROX\_COUNT\_DISTINCT\_ESTIMATE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/distinct-count-estimation-functions/approx-count-distinct-estimate.md)

***

Modified at: July 29, 2024

Source: [/db/v9.1/reference/sql-reference/distinct-count-estimation-functions/](https://docs.singlestore.com/db/v9.1/reference/sql-reference/distinct-count-estimation-functions/)

(An index of the documentation is available at /llms.txt)
