Bulk Calculation of Quartile and Median Values in SQL Server

Previously, I posted about calculating median and quartiles in Analysis Services. In this post, I’ll do the same calculations over multiple sets of rows in a SQL Server database.

SQL Server offers an NTILE() function, but this is not equivalent to the Quartile() function in Excel which is the functionality we are trying to replicate. NTILE() is a window function that essentially returns a value representing the group into which each record would fall if you are trying to rank records. What we want is an aggregation function that will return a single value (25th percentile, 50th percentile or median, or 75th percentile) for each group of records.

The code below is an example of how this can be done in a single SQL statement.

WITH SourceData AS (

    --Table of sets over which we want to produce multiple quartiles.
    SELECT sales.SalesAmount Measure,
        sales.ProductKey AS GroupID
    FROM FactInternetSales AS sales

)
--Aggregate into a single record for each group, using MAX to select the non-null 
--detail value for each column.
SELECT GroupID, 
    (Max(Q1NextVal) - MAX(Q1Val)) * Max(Q1Frac) + Max(Q1Val) Quartile1,
    (Max(MidVal1) + Max(MidVal2)) / 2 Median,
    (Max(Q3NextVal) - MAX(Q3Val)) * Max(Q3Frac) + Max(Q3Val) Quartile3

--Save into a separate table if desired.
--INTO dbo.PreCalculatedQuartiles
FROM (
    --Expose the detail values for only the records at the index values 
    --generated by the summary subquery. All other values are left as NULL.
    SELECT detail.GroupID, 
        CASE WHEN RowNum = Q1Idx THEN Measure ELSE NULL END Q1Val,
        CASE WHEN RowNum = Q1Idx + 1 THEN Measure ELSE NULL END Q1NextVal,
        CASE WHEN RowNum = Q3Idx THEN Measure ELSE NULL END Q3Val,
        CASE WHEN RowNum = Q3Idx + 1 THEN Measure ELSE NULL END Q3NextVal,
        Q1Frac,
        Q3Frac,
        CASE WHEN RowNum = MidPt1 THEN Measure ELSE NULL END MidVal1,
        CASE WHEN RowNum = MidPt2 THEN Measure ELSE NULL END MidVal2
    FROM
        --Calculate a row number sorted by measure for each group.
        (SELECT *,  ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY Measure) RowNum
        FROM SourceData) AS detail

    INNER JOIN (
        --Summarize to find index numbers and fractions we need to use to locate 
        --the values at the quartile points.
        SELECT GroupID, 
            FLOOR((COUNT(*) + 3) / 4.0) Q1Idx,
            ((COUNT(*) + 3) / 4.0) - FLOOR((COUNT(*) + 3) / 4.0) Q1Frac,
            (COUNT(*) + 1) / 2 AS MidPt1,
            (COUNT(*) + 2) / 2 AS Midpt2,
            FLOOR((COUNT(*) * 3 + 1) / 4.0) Q3Idx,
            ((COUNT(*) * 3 + 1) / 4.0) - FLOOR((COUNT(*) * 3 + 1) / 4.0) Q3Frac
        FROM SourceData
        GROUP BY GroupID
        HAVING COUNT(*) > 1
    
    ) AS summary ON detail.GroupID  = summary.GroupID

) AS combined

GROUP BY GroupID;

First, the SourceData CTE specifies our source data over which we want to calculate one set of quartiles for each value of our grouping column. We are aliasing the input column names to GroupID and Measure to allow us to make later modifications to make this more generalized. In the summary subquery, we are determining, from the record count for each group, the index numbers we need to locate the records that will be used for our quartiles along with fractional values that will also be required. The detail subquery orders records by measure, assigns an index number to each row and then displays data values at rows that have been identified in the summary subquery returning NULL values for all other rows. Finally, the combined subquery reduces the rows to one record per group and calculates the final quartile values.

It’s possible to generalize this query to a SQL Server table-valued function. You would simply need to create a user-defined table type with Group ID and Measure columns, use an instance of this as the input parameter to the function, and then use the above SQL as the body of the UDF, replacing the CTE with the input parameter.

Advertisements

11 Responses to Bulk Calculation of Quartile and Median Values in SQL Server

  1. Pradeep Nair says:

    Thanks. This script really helped me a lot

    • Anal Patel says:

      Here is the other solution…
      I have created SP for the same.

      Calculation is based on given below link…

      http://www.math.montana.edu/~nmp/materials/ess/hydrosphere/expert/rrf/statterm.html

      EDIT:
      Code moved to PasteBin: http://pastebin.com/3A69pkwQ

      • Mike H says:

        Anal,
        I moved the code you supplied to PasteBin as it was a bit long for a comment.
        I tested your code on my dev box and unsurprisingly since there are a few different ways of calculating a quartile, it returns different values than mine.
        Mine uses the same algorithm as the original QUARTILE() function in Excel. In Excel 2010, there are now two different QUARTILE() functions, QUARTILE.INC() and QUARTILE.EXC().
        You can look at the online help for these, but it looks like QUARTILE() and QUARTILE.INC() are the same thing.
        Your algorithm looks to be similar to Method 1 on http://en.wikipedia.org/wiki/Quartile which gives different results from Excel’s QUARTILE.EXC() on Example 2.

      • Mike H says:

        Here’s a variation of my technique but using (I think) the Wikipedia Method 1 algorithm and Anal’s data.
        http://pastebin.com/AAMGFKrq
        I’m not 100% sure this would be correct, but it seemed to give the same answers across the different sets of data. I reckon the part where it identifies the indexes of the quartiles could be refactored a bit.

    • chotu says:

      how can we calculate if the count is even for the lower and upper quartile calculation

  2. Ryan James says:

    Thank you! I wish I had found this 2 days ago!

  3. Pingback: Confluence: Data Warehouse

  4. Pingback: Confluence: Data Warehouse

  5. Pingback: Confluence: EIS Team Wiki

  6. Pingback: Confluence: Data Warehouse

  7. Pingback: Confluence: Data Warehouse

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: