## Calculating percentiles in MYSQL

Posted by - raja 2010/10/10 03:53:31

Computing 90th percentile in MYSQL.. I was doing some interesting analysis on percentiles. In the process, I had to put some results in MYSQL and use it for calculating them. I am somewhat fanatical about the usage of select statements without recourse to stored procedures for doing anything in the database. So I was hunting for a "pure select" way of accomplishing this.  It turns out that MYSQL does support stored procedure variables without mandating the use of the "create procedure" directive. This, to me, represented the best of both worlds. I wanted to avoid the admittedly dubious pain (that springs out of my personal prejudices about writing an SP) of writing a stored procedure and instead accomplish the objective with a plain vanilla select.  So off I went in that direction and this blog post is the result.

Let us say we have a table called rank1 that stores two columns - an Id and a value. The ID is non-unique in that an ID can have multiple values. I want to compute the average and the 90th percentile of the values for each ID. Example: Consider the following data:

ID Value
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
2 5
2 10
2 15
2 20
2 25
2 30
2 35
2 40
2 45
2 50

What I want is the following:

ID 90th percentile Average Count
1 9 5.5 10
2 45 27.5 10

Here is a sql that accomplishes this:

``````select
id,
value as percentile,
count_col,average from (
SELECT @row := if(@prev_id = rank2.id, @row + 1, 1) AS row,
rank2.ID as id ,
rank2.VALUE as value ,
COUNT_TABLE.COUNT_COL as count_col,
count_table.percentile_rownum as percentile_rownum,
count_table.average as average,
@prev_id := rank2.id as dummy
FROM RANK1 AS RANK2,
(
SELECT COUNT(*) as COUNT_COL,
rank1.id,
round(0.90*count(*)) as percentile_rownum,
avg(rank1.value) as average
FROM RANK1 group by RANK1.ID) AS COUNT_TABLE
WHERE
count_table.id = rank2.id
ORDER BY rank2.id, VALUE
) as
allrows where
row = percentile_rownum``````

The gist is as follows. We first compute the average, count and the number at which the percentile figure occurs in the inner most select. We use this information to join back all the Ids and values along with the grouped information in the intermediary select. This intermediate SQL also computes the row number IDs and groups them.
The if clause in the @row computation works to group the row numbers by ID. The @id is required to achieve this since it is a variable that monitors the change in ID from 1 to 2.

The percentile is computed by finding the row number that corresponds to the number that was computed in the inner most select. Shoot me a question if you are having trouble following the logic.

##### Tags
MYSQL  Performance Analysis  