5 Ways To Calculate Median In SQL
3 years back, when I thought I would get into the world of Data Science, things seemed quite overwhelming for me. So many things to learn in just one life! As I went on to explore the skills required to become a Data Scientist, something that gave me assurance was Statistics. I had loved Statistics since high school and through college. One of the many things that we learned early on in Statistics was to calculate Mean, Median and Mode of the data. In professional lingo, one would call them Measures of Central Tendency.
Quick Revision
Mean: The “average” number; found by adding all data points and dividing by the number of data points.
Median: The middle number; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers).
Mode: The most frequent number — that is, the number that occurs the highest number of times.
I have a microblog on Measure of Central Tendency which further explains their need and use cases.
SQL: My One True Love
Needless to say, I am a flag bearer of “SQL Before Everything” movement!
SQL is the best tool when it comes to Data Analytics. One can perform every analysis there is using SQL. But can you calculate Measures of Central Tendency?
Yes, absolutely!
Calculating the Mean is very straight forward. One can use the AVG() aggregate function.
For calculating the Mode, one count the number of records in each category and obtain the group with highest count using the MAX() function.
But, calculating the Median can be little tricky as there are no direct functions to do so. Having said that, it is still absolutely possible to calculate Median in SQL. Not in one, but in 5 ways.
Let’s Calculate Median Using SQL
First, we should understand the mathematical formula used to calculate Median.
For a dataset having N elements, ordered from smallest to greatest,
Median = (N+1)/2th element, if N is odd
Median = (N/2th element + (N/2 + 1)th element)/2, if N is even
From the above formulae, we have keep in mind that the dataset must be ordered and we need to account for odd and even number of datapoints.
By Using A Variable And OFFSET — FETCH
DECLARE @c BIGINT = (SELECT COUNT(*) FROM sales.order_items)
SELECT AVG(list_price) AS "Median"
FROM (
SELECT list_price
FROM sales.order_items
ORDER BY list_price
OFFSET (@c - 1)/2 ROWS
FETCH NEXT 1 + (1 - @c%2) ROWS ONLY
) data
The variable @c
is assigned the total count of rows in the sales.order_items
table. The inner subquery orders the list prices in ascending order. The OFFSET (@c - 1)/2 ROWS
clause skips the first half of the rows (when @c
is odd) or exactly half the rows (when @c
is even). The FETCH NEXT 1 + (1 - @c%2) ROWS ONLY
clause fetches the next row when @c
is odd or no additional rows when @c
is even. Finally, the outer query calculates the average of the retrieved rows, which gives the median value.
By Using ROW_NUMBER() Window Function
SELECT AVG(list_price) AS "Median"
FROM
(
SELECT list_price,
ROW_NUMBER() OVER (ORDER BY list_price ASC, order_id ASC) AS RowAsc,
ROW_NUMBER() OVER (ORDER BY list_price DESC, order_id DESC) AS RowDesc
FROM sales.order_items
) data
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
The above query first assigns row numbers in ascending and descending order to each row based on the list_price
and order_id
. The subquery retrieves the list_price
and the corresponding row numbers for both ascending and descending order. The outer query calculates the average of the list_price
where the row number in ascending order is either equal to or within one position from the row number in descending order.
By Using ORDER BY, MIN() And Max() Functions
SELECT (
(SELECT MAX(list_price)
FROM (
SELECT TOP 50 PERCENT list_price
FROM sales.order_items
ORDER BY list_price) AS BottomHalf) +
(SELECT MIN(list_price)
FROM (
SELECT TOP 50 PERCENT list_price
FROM sales.order_items
ORDER BY list_price DESC) AS TopHalf)
) / 2 AS Median
Here, we divide the dataset into two halves (top and bottom) based on the list_price
. The inner subqueries find the top 50% and bottom 50% of the list prices separately. The MAX(list_price)
and MIN(list_price)
functions respectively retrieve the highest and lowest values from each half. The final result is obtained by adding the maximum value from the bottom half and the minimum value from the top half, then dividing the sum by 2 to get the median.
By Using NTILE() Window Function
SELECT MAX(list_price) AS "Median"
FROM (
SELECT list_price,
NTILE(4) OVER(ORDER BY list_price) AS Quartile
FROM sales.order_items
) X
WHERE Quartile = 2
The NTILE
function divides the result set into 4
equal parts (quartiles). The ORDER BY list_price
sorts the list prices in ascending order. The outer query selects the maximum list price from the subquery where Quartile
is 2
, representing the second quartile, which is the median.
By Using PERCENTILE_CONT() Window Function
SELECT DISTINCT PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY list_price) OVER() AS "Median"
FROM sales.order_items
The PERCENTILE_CONT
function calculates a specific percentile for a given column within a group. The 0.5
argument indicates that we want to find the 50th percentile, which is the median. The ORDER BY list_price
clause ensures that the list prices are sorted in ascending order before calculating the median. The DISTINCT
keyword makes sure that we only get one result for the overall median, even if there are multiple rows with the same median value. The OVER()
function is used without any partitioning, so the PERCENTILE_CONT
is applied to the entire result set.
NOTE: All queries are written in MS SQL Server.
Each of these queries offers a different approach to calculating the median in SQL, and their effectiveness may vary depending on the dataset size and database system being used. It’s always a good practice to test these queries on your specific dataset to find the most efficient and accurate method for calculating the median.
I am sure there are other ways to calculate Median in SQL. Please make sure to comment that if you come across any. Let me know which one is your favorite or preferred method from the above. I hope this blog helps you to enhance your SQL skills. And yes, “SQL Before Everything”!