SingleStore Managed Service

APPROX_PERCENTILE

Calculates the approximate percentile for a set of values. The calculation occurs for a given percentage.

APPROX_PERCENTILE executes faster than the PERCENTILE_DISC and PERCENTILE_CONT and MEDIAN functions, and is an alternative to using those functions.

APPROX_PERCENTILE is useful for workloads with large datasets that require statistical analysis.

Syntax
APPROX_PERCENTILE (col|expr , percentage [, error_tolerance])
Arguments
  • col: A column of a numeric data type.

  • expr: An expression that evaluates to a numeric data type.

  • percentage: A numeric value between 0 and 1.

  • error_tolerance: A numeric value between 0 and 0.5 (not including 0).

Return Type

A DOUBLE data type.

Remarks
  • While calculating the percentile, APPROX_PERCENTILE ignores all the NULL values.

  • A percentage value of 0.5 will return the approximate median for the data set.

  • The default value for error_tolerance is 0.01. A higher error_tolerance value means a higher tolerance for error, hence lower accuracy.

  • APPROX_PERCENTILE can be used as a scalar aggregate function. It can also be used with a GROUP BY clause.

Notice

If you require an exact value instead of an estimation, you should use the PERCENTILE_CONT or PERCENTILE_DISC functions.

Error Tolerance Behavior

The error threshold of APPROX_PERCENTILE depends directly on the size of data and the requested quantile. We expect the approximate estimation to differ from the exact estimation within a bounded range of [-0.01,0.01] for more than 100k rows. For example, for a quantile of 0.5 (50th percentile), you can expect the estimation value will be within the range of 0.49 or 0.51 deterministic/original quantile (+-0.01). Additionally, accuracy improves with larger datasets, and also as the percentage approaches the lower or upper bounds of the distribution (i.e., 1st and 99th percentile).

Examples

The following examples use the courses table:

CREATE TABLE courses(course_code TEXT, section_number INT, number_students INT);

The contents of the courses table are as follows:

+-------------+----------------+-----------------+
| course_code | section_number | number_students |
+-------------+----------------+-----------------+
| CS-101      |              1 |              20 |
| CS-101      |              2 |              16 |
| CS-101      |              3 |              22 |
| CS-101      |              4 |              25 |
| CS-101      |              5 |              22 |
| CS-150      |              1 |              10 |
| CS-150      |              2 |              16 |
| CS-150      |              3 |              11 |
| CS-150      |              4 |              17 |
| CS-150      |              5 |               9 |
| CS-201      |              1 |              14 |
| CS-201      |              2 |              17 |
| CS-301      |              1 |               7 |
| CS-301      |              2 |              10 |
+-------------+----------------+-----------------+
Example 1

The following example demonstrates the usage of APPROX_PERCENTILE by calculating the number of students across different courses for a given percentile.

Example 1a
SELECT course_code, APPROX_PERCENTILE(number_students, 0.65) AS "percentile"
FROM courses GROUP BY course_code;
****
+-------------+------------+
| course_code | percentile |
+-------------+------------+
| CS-301      |     9.4000 |
| CS-201      |    16.4000 |
| CS-101      |    22.0000 |
| CS-150      |    14.7500 |
+-------------+------------+
Example 1b

The following example calculates the median of the dataset.

SELECT course_code, APPROX_PERCENTILE(number_students, 0.5) AS "approx_median"
FROM courses GROUP BY course_code;
****
+-------------+---------------+
| course_code | approx_median |
+-------------+---------------+
| CS-301      |        8.5000 |
| CS-201      |       15.5000 |
| CS-101      |       22.0000 |
| CS-150      |       11.0000 |
+-------------+---------------+
Example 2

The APPROX_PERCENTILE function can be used as a Window Function.

Example 2a
SELECT DISTINCT(course_code), APPROX_PERCENTILE(number_students, 0.5) OVER (ORDER BY course_code) AS "percentile" FROM courses;
****
+-------------+------------+
| course_code | percentile |
+-------------+------------+
| CS-101      |    22.0000 |
| CS-150      |    16.5000 |
| CS-201      |    16.5000 |
| CS-301      |    16.0000 |
+-------------+------------+
Example 2b
SELECT DISTINCT(course_code), APPROX_PERCENTILE(number_students, 0.3, 0.3) OVER (ORDER BY course_code) AS "percentile" FROM courses;
****
+-------------+------------+
| course_code | percentile |
+-------------+------------+
| CS-101      |    20.0000 |
| CS-150      |    13.5000 |
| CS-201      |    14.2000 |
| CS-301      |    10.7000 |
+-------------+------------+

Related Topics