APPROX_PERCENTILE
Calculates the approximate percentile for a set of values. The calculation occurs for a given percentage.
APPROX_PERCENTILE
executes faster than the PERCENTILE_DISC and PERCENTILE_CONT and MEDIAN functions, and is an alternative to using those functions.
APPROX_PERCENTILE
is useful for workloads with large datasets that require statistical analysis.
May be used as a window function.
Syntax
Aggregate Function
APPROX_PERCENTILE (col | expr , percentage [, error_tolerance])
Window Function
APPROX_PERCENTILE (col | expr , percentage [, error_tolerance]) OVER ( [PARTITION BY (col | expr)] )
Arguments
col
: A column of a numeric data type.expr
: An expression that evaluates to a numeric data type.percentage
: A numeric value between0
and1
.error_tolerance
: A numeric value between0
and0.5
(not including0
).
Return Type
A DOUBLE
data type.
Remarks
While calculating the percentile,
APPROX_PERCENTILE
ignores all theNULL
values.A
percentage
value of0.5
will return the approximate median for the data set.The default value for
error_tolerance
is0.01
. A highererror_tolerance
value means a higher tolerance for error, hence lower accuracy.APPROX_PERCENTILE
can be used as a scalar aggregate function. It can also be used with aGROUP BY
clause.
Notice
If you require an exact value instead of an estimation, you should use the PERCENTILE_CONT
or PERCENTILE_DISC
functions.
Error Tolerance Behavior
The error threshold of APPROX_PERCENTILE
depends directly on the size of data and the requested quantile. We expect the approximate estimation to differ from the exact estimation within a bounded range of [-0.01,0.01] for more than 100k rows. For example, for a quantile of 0.5 (50th percentile), you can expect the estimation value will be within the range of 0.49 or 0.51 deterministic/original quantile (+-0.01). Additionally, accuracy improves with larger datasets, and also as the percentage approaches the lower or upper bounds of the distribution (i.e., 1st and 99th percentile).
Examples
The following examples use the courses
table:
CREATE TABLE courses(course_code VARCHAR(25), course_name VARCHAR(25), section_number INT, students_enrolled INT); INSERT INTO courses VALUES ('HIS-101', 'World History', '10', '65'), ('ALG-201', 'Algebra 2', '20', '35'), ('PSY-400', 'Abnormal Psychology', '20', '28'), ('SOC-200', 'Sociology 2', '20', '25'), ('SCI-101', 'Biology 1', '10', '72'), ('HIS-101', 'World History', '20', '47'), ('ALG-101', 'Algebra 1', '10', '42'), ('PSY-401', 'Abnormal Psychology', '10', '25'), ('SOC-201', 'Sociology 2', '20', '28'), ('SCI-102', 'Biology 2', '20', '91');
The contents of the courses
table are as follows:
Aggregate Function
The following example demonstrates the usage of APPROX_PERCENTILE
by calculating the number of students across different courses for a given percentile.
Example 1
SELECT course_code, APPROX_PERCENTILE(students_enrolled, 0.65) AS "percentile" FROM courses GROUP BY course_code; **** +-------------+------------+ | course_code | percentile | +-------------+------------+ | SCI-102 | 91.0000 | | PSY-400 | 28.0000 | | SCI-101 | 72.0000 | | SOC-201 | 28.0000 | | ALG-201 | 35.0000 | | ALG-101 | 42.0000 | | SOC-200 | 25.0000 | | HIS-101 | 61.4000 | | PSY-401 | 25.0000 | +-------------+------------+
Example 2
The following example calculates the median of the dataset.
SELECT course_code, APPROX_PERCENTILE(students_enrolled, 0.5) AS "approx_median" FROM courses GROUP BY course_code; **** +-------------+---------------+ | course_code | approx_median | +-------------+---------------+ | SCI-102 | 91.0000 | | PSY-400 | 28.0000 | | SCI-101 | 72.0000 | | SOC-201 | 28.0000 | | ALG-201 | 35.0000 | | ALG-101 | 42.0000 | | SOC-200 | 25.0000 | | HIS-101 | 56.0000 | | PSY-401 | 25.0000 | +-------------+---------------+
Window Function
SELECT DISTINCT(course_code), APPROX_PERCENTILE(students_enrolled, 0.3) OVER (ORDER BY course_code) AS "percentile" FROM courses; **** +-------------+------------+ | course_code | percentile | +-------------+------------+ | ALG-101 | 42.0000 | | ALG-201 | 35.7000 | | HIS-101 | 39.9000 | | PSY-400 | 35.0000 | | PSY-401 | 30.1000 | | SCI-101 | 32.2000 | | SCI-102 | 34.3000 | | SOC-200 | 29.4000 | | SOC-201 | 28.0000 | +-------------+------------+
Related Topics