# Bulk Calculation of Quartile and Median Values in SQL Server

August 15, 2011 11 Comments

Previously, I posted about calculating median and quartiles in Analysis Services. In this post, I’ll do the same calculations over multiple sets of rows in a SQL Server database.

SQL Server offers an NTILE() function, but this is not equivalent to the Quartile() function in Excel which is the functionality we are trying to replicate. NTILE() is a window function that essentially returns a value representing the group into which each record would fall if you are trying to rank records. What we want is an aggregation function that will return a single value (25th percentile, 50th percentile or median, or 75th percentile) for each group of records.

The code below is an example of how this can be done in a single SQL statement.

WITH SourceData AS ( --Table of sets over which we want to produce multiple quartiles. SELECT sales.SalesAmount Measure, sales.ProductKey AS GroupID FROM FactInternetSales AS sales ) --Aggregate into a single record for each group, using MAX to select the non-null --detail value for each column. SELECT GroupID, (Max(Q1NextVal) - MAX(Q1Val)) * Max(Q1Frac) + Max(Q1Val) Quartile1, (Max(MidVal1) + Max(MidVal2)) / 2 Median, (Max(Q3NextVal) - MAX(Q3Val)) * Max(Q3Frac) + Max(Q3Val) Quartile3 --Save into a separate table if desired. --INTO dbo.PreCalculatedQuartiles FROM ( --Expose the detail values for only the records at the index values --generated by the summary subquery. All other values are left as NULL. SELECT detail.GroupID, CASE WHEN RowNum = Q1Idx THEN Measure ELSE NULL END Q1Val, CASE WHEN RowNum = Q1Idx + 1 THEN Measure ELSE NULL END Q1NextVal, CASE WHEN RowNum = Q3Idx THEN Measure ELSE NULL END Q3Val, CASE WHEN RowNum = Q3Idx + 1 THEN Measure ELSE NULL END Q3NextVal, Q1Frac, Q3Frac, CASE WHEN RowNum = MidPt1 THEN Measure ELSE NULL END MidVal1, CASE WHEN RowNum = MidPt2 THEN Measure ELSE NULL END MidVal2 FROM --Calculate a row number sorted by measure for each group. (SELECT *, ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY Measure) RowNum FROM SourceData) AS detail INNER JOIN ( --Summarize to find index numbers and fractions we need to use to locate --the values at the quartile points. SELECT GroupID, FLOOR((COUNT(*) + 3) / 4.0) Q1Idx, ((COUNT(*) + 3) / 4.0) - FLOOR((COUNT(*) + 3) / 4.0) Q1Frac, (COUNT(*) + 1) / 2 AS MidPt1, (COUNT(*) + 2) / 2 AS Midpt2, FLOOR((COUNT(*) * 3 + 1) / 4.0) Q3Idx, ((COUNT(*) * 3 + 1) / 4.0) - FLOOR((COUNT(*) * 3 + 1) / 4.0) Q3Frac FROM SourceData GROUP BY GroupID HAVING COUNT(*) > 1 ) AS summary ON detail.GroupID = summary.GroupID ) AS combined GROUP BY GroupID;

First, the SourceData CTE specifies our source data over which we want to calculate one set of quartiles for each value of our grouping column. We are aliasing the input column names to GroupID and Measure to allow us to make later modifications to make this more generalized. In the summary subquery, we are determining, from the record count for each group, the index numbers we need to locate the records that will be used for our quartiles along with fractional values that will also be required. The detail subquery orders records by measure, assigns an index number to each row and then displays data values at rows that have been identified in the summary subquery returning NULL values for all other rows. Finally, the combined subquery reduces the rows to one record per group and calculates the final quartile values.

It’s possible to generalize this query to a SQL Server table-valued function. You would simply need to create a user-defined table type with Group ID and Measure columns, use an instance of this as the input parameter to the function, and then use the above SQL as the body of the UDF, replacing the CTE with the input parameter.

Thanks. This script really helped me a lot

Here is the other solution…

I have created SP for the same.

Calculation is based on given below link…

http://www.math.montana.edu/~nmp/materials/ess/hydrosphere/expert/rrf/statterm.html

EDIT:Code moved to PasteBin: http://pastebin.com/3A69pkwQ

Anal,

I moved the code you supplied to PasteBin as it was a bit long for a comment.

I tested your code on my dev box and unsurprisingly since there are a few different ways of calculating a quartile, it returns different values than mine.

Mine uses the same algorithm as the original QUARTILE() function in Excel. In Excel 2010, there are now two different QUARTILE() functions, QUARTILE.INC() and QUARTILE.EXC().

You can look at the online help for these, but it looks like QUARTILE() and QUARTILE.INC() are the same thing.

Your algorithm looks to be similar to Method 1 on http://en.wikipedia.org/wiki/Quartile which gives different results from Excel’s QUARTILE.EXC() on Example 2.

Here’s a variation of my technique but using (I think) the Wikipedia Method 1 algorithm and Anal’s data.

http://pastebin.com/AAMGFKrq

I’m not 100% sure this would be correct, but it seemed to give the same answers across the different sets of data. I reckon the part where it identifies the indexes of the quartiles could be refactored a bit.

how can we calculate if the count is even for the lower and upper quartile calculation

Thank you! I wish I had found this 2 days ago!

Pingback: Confluence: Data Warehouse

Pingback: Confluence: Data Warehouse

Pingback: Confluence: EIS Team Wiki

Pingback: Confluence: Data Warehouse

Pingback: Confluence: Data Warehouse