Calculate Percentiles
Use T-SQL to improve your statistical analysis
October 29, 2012
Percentiles and percentile ranks are values used in statistical analysis of data. Given a set of values representing certain observations or scores (e.g., students’ marks in an exam), a percentile is a value below which a certain percent of the scores fall. For example, for the following exam scores: 40, 55, 70, 70, 70, 75, 75, 80, 80, 95, and 100, the 50th percentile (also known as the median) is the mark below which 50 percent of the scores fall (in this case, 75). The 25th percentile (also known as the first quartile) is 70, and so on.
Related: Inverse Distribution Functions
A percentile rank is the percent of values in a certain set of observations or scores that are lower than a certain value. For example, using the same scores, the percentile rank of the score 75 is 50 percent because 50 percent of the scores are lower than 75. The percentile rank of the score 80 is 70 percent, and so on.
After I explain how to calculate percentile ranks using T-SQL, I’ll show you how to calculate percentiles. In my examples I’ll use a table called Marks. To follow my demonstration, you’ll need to run the code in Listing 1 to create the Marks table in the tempdb database and populate it with sample data.
Note that no standard algorithm exists for calculating percentiles. I use algorithms similar to the ones used by Microsoft Excel. If you want to implement other algorithms, you’ll probably need to adjust my techniques.
Related: T-SQL Deep Dives: Create Efficient Queries
Percentile Ranks
As I already mentioned, a percentile rank is the percent of values from a certain data set (call it S) that are smaller than a certain value (call it v). The PERCENTRANK(S, v) function in Excel uses different calculations based on whether the given value appears in the given data set. If it does, the function uses the following formula: (rnk-1)/ (cnt-1), where rnk represents the rank of the value (number of values in S that are smaller than v, plus 1) and cnt represents the count of values in S.
The implementation of the percentile rank calculation for a value that appears in the set is simple; all you need is the rank of the value and the count of the values in the set. The rank can be calculated in T-SQL using the RANK function, and the count can be calculated using the COUNT function. The code in Listing 2 calculates the percentile rank of each mark from the Marks table, generating the output shown in Table 1.
The query that defines the common table expression (CTE) called MarksRnkCnt calculates, for each mark, the rank of the mark (column rnk) using the RANK function and the COUNT of rows in the table (column cnt) using the COUNT function with the OVER clause. The outer query implements the formula for percentile rank using the expression 1.*(rnk-1)/(cnt-1). The use of 1. (one dot) here is to cause implicit conversion of the operands of the expression to numeric values, otherwise T-SQL would use integer division between the integer operands.
If the given value doesn’t appear in the set, the PERCENTRANK function in Excel interpolates to return the result. In such a case, the formula the function uses is:
pctrnk_smaller_val + (v - smaller_val) / (larger_val - smaller_val) * (pctrnk_larger_val - pctrnk_smaller_val)
where v represents the given value, smaller_val and larger_val represent the two closest values from the data set that are smaller and larger than v, and pctrnk_larger_val and pctrnk_smaller_val represent the percentile ranks of those two values. As an example, 85 is a mark that doesn’t appear in the Marks table. The two closest marks smaller and larger than 85 are 80 and 95, and their percentile ranks are 0.7 and 0.9 (see Table 1). The percentile rank of the mark 85 is therefore calculated as:
0.7 + (85 - 80) / (95 - 80) * (0.9 - 0.7) = 0.766667
This means that with interpolation, 76.6667 percent of the marks are lower than the mark 85.
To add the interpolation logic to the calculation, you first need to pair adjacent marks from the Marks table, and calculate their percentile ranks. This is achieved with the code in Listing 3, generating the output shown in Table 2.
The code that defines the CTE called Marks RnkCnt calculates a rank, a dense rank, and a count for each mark from the Marks table. The code that defines the CTE called PctRanks queries MarksRnkCnt, calculates the percentile ranks as I previously explained, and removes duplicate rows. The dense rank becomes a row number after the removal of duplicates. The outer query joins two instances of PctRanks aliased as Cur and Nxt to pair adjacent marks and their respective percentile ranks.
The code in Listing 4 shows how to add logic that for an input value (stored in the variable @mark) calculates the percentile rank of the input. The code defines a CTE called PctRank Ranges based on the last query from Listing 3. The outer query in Listing 4 identifies the relevant pair of marks and percentile ranks from PctRankRanges. This is achieved with the predicate:
(@mark > mark_from AND @mark <= mark_to) OR (rownum = 1 AND @mark = mark_from)
A CASE expression in the SELECT list determines whether interpolation is required. If the input mark (@mark) is equal to one of the existing marks in the range, the CASE expression returns the corresponding percentile rank; otherwise, the CASE expression uses the aforementioned formula to apply interpolation. The code in Listing 4 returns the value 0.766667 (76.6667 percent).
With minor revisions to the code in Listing 4 you can calculate the percentile ranks of multiple values stored in a table. Listing 5 shows how to do so. The code populates a table variable called @MyMarks with a few random marks. The outer query in Listing 5 joins the table variable with PctRankRanges, to apply the calculation to each mark from the table variable as opposed to a single mark in a scalar variable. When I ran the code in Listing 5 on my system I got the output shown in Table 3. Note that the calculation returns NULLs for marks that are smaller than the minimum or higher than the maximum in the Marks table.
Percentiles
As I explained, a percentile is a value below which a certain percent of values fall. I’ll describe the way Excel calculates the PERCENTILE function, then I’ll explain how to implement similar logic with T-SQL.
Given an ordered set of values v1, v2, …, vcnt, and a request for percentile pct, Excel’s PERCENTILE function calculates the pctth percentile as follows:
Calculate n as pct * (cnt - 1) + 1
Using the marks from our Marks table as an example, the ordered set of marks is: v1 = 40, v2 = 55, v3 = 70, v4 = 70, v5 = 70, v6 = 75, v7 = 75, v8 = 80, v9 = 80, v10 = 95, v11 = 100. In our case cnt = 11. For pct = 0.95 (95th percentile) n = 0.95 * (11 - 1) + 1 = 10.5.
Now, let k be the integer part of n and d be the decimal part of n. In our case, k = 10 and d = 0.5. If d = 0, return vk. That is, when n is whole (n = k), return the kth value in the ordered set. If d <> 0, interpolate to produce the percentile as follows:
When k = cnt, return vk-1 + d * (vk - vk-1)
When k < cnt, return vk + d * (vk+1 - vk)
In our case, k < cnt (10 < 11), therefore the calculation is vk + d * (vk+1 - vk): 95 + 0.5 * (100 - 95) = 97.5.
The code in Listing 6, implements this logic and calculates the percentiles for a whole set of percents represented by the CTE called Pcts. Play with your own percents to see how the output changes. The CTE called CntMarks has a single row with the value cnt representing the count of rows in the Marks table. The query defining the CTE called PctCnt_n calculates for each percent from Pcts the value n described earlier. The query in the CTE called PctCnt_ndk breaks n to its integer and decimal components k and d, respectively. The query defining the CTE called MarksRn simply calculates row numbers (column rn) for the Marks from the Marks table. Think of the row numbers as the positions in the ordered set (1, 2, …, cnt), and the corresponding marks as v1, v2, …, vcnt. The outer query joins PctCnt_ndk and MarksRn, matching to each percent from PctCnt_ndk the relevant marks from MarksRn. To remind you, when d = 0, the relevant mark is vk; else, when k = cnt, the two relevant marks are vk-1 and vk; else, when k < cnt, the relevant marks are vk and vk+1. In all three cases, for each percent, the formulas for calculating the percentiles can be generalized to:
MIN(mark) + d * (MAX(mark) - MIN(mark))
The output of the code in Listing 6 is shown in Table 4. As you can see in the output, the median is the mark 75, the lower quartile is the mark 70, and so on.
Note that Excel interpolates to calculate the percentile when n is not whole. Other ways to calculate percentiles are to round n and return the value in the rounded position in the ordered set, but I’ll leave this method for you as an exercise. [Editor’s Note: If you try this method, please email us with your experiences. Send your feedback to [email protected].]
Statistically Speaking
You can easily use T-SQL to implement statistical calculations such as percentile and percentile rank. In addition, you’ll likely encounter many other Excel-supported statistical calculations and other types of calculations that are handy for use in a database. Besides the usefulness of such functions in statistical analysis, trying to implement them with T-SQL is a great exercise.
About the Author
You May Also Like