Box-Cox Transformation for Calculating Medians and Quartiles

Averages provide for rapid at-a-glance understanding and comparisons of datasets. Averages were easy to calculate even before computerized analysis, and thus entered not only the toolset of decision makers at every level, but the popular speech as well.

Greg Neyman

January 16, 2014

6 Min Read
ITPro Today logo in a gray background | ITPro Today

   Averages provide for rapid at-a-glance understanding and comparisons of datasets. Averages were easy to calculate even before computerized analysis, and thus entered not only the toolset of decision makers at every level, but the popular speech as well.

   However, averages only work on data that fits well under a bell-curve. A lot of data will naturally fit under a bell curve, but many commonly collected data points, such as household income, will not. These types of data are called Skewed, and trying to calculate an average on skewed data will frequently yield nonsensical results. For example, the average United States household income in the last census was $69,821. After a moment of analysis, most people would agree this just doesn’t seem right. But technically, that was the average. The problem is at many people around or below the poverty-line do not have large differences in their income, but the few wealthy can have millions of dollars in annual income. Mathematically, this draws the average upwards.

   In order to examine and compare skewed datasets, a Median can be calculated. A median is simply the middle of a dataset when it is placed in numerical order. This metric will discount the few astronomical values at the top of the dataset, and make more sense than an average when data is skewed, The median household income in the US is $50,502, which seems to make more sense.

   Unfortunately, while SQL has native ability to report averages and standard deviations, there is no ability like that to report medians and quartiles. Workarounds on the internet commonly recommend sorting the column you want a median from, and taking the middle of the results, but this is labor intensive. It also makes the report unable to use convenience (convenient?) features SQL users take for granted, like grouping and reporting multiple columns simultaneously. To get summaries like that, you would need to sort each column individually, limiting to groups one query at a time.

   Here is where a Box-Cox transformation comes in. A Box-Cox transformation is a deceptively simple algebraic formula designed to fit moderately skewed data under a bell curve. Below is the stored function code to transform a datapoint:

 

CREATE FUNCTION `BOXCOX`(y FLOAT, lambda FLOAT) RETURNS floatDETERMINISTICBEGINIF lambda=0THEN RETURN LN(y);END IF ;IF lambda<>0THEN RETURN (POW(y,lambda)-1)/lambda;END IF;END

 

   In this function, “y’” is the datapoint being transformed,  and lambda is the transformation coefficient. Choosing the correct lambda  is the key to making the transform work. It is usually a value between -2 and 2.

   The literature commonly recommends a method of determining the correct value of Lambda, which I will not get into, because, as a non statistician, and non mathematician, it is difficult for me to understand, let alone implement. But since my goal is to reduce skew, I commonly choose the Lambda that reduces skew the most. Below is php code that returns the skew of an array, its  standard error of skew (ses), and a determination of whether or not the supplied array is skewed.

 

function skew($array) {   $average=array_sum($array)/sizeof($array);   foreach($array as $v) {    $sum+=pow($v-$average,3);   }   $skew=abs($sum/((sizeof($array)-1)*pow(stdev($array),3)));   $ret['skew']=$skew;   $ses=sqrt(6/sizeof($array));   $ret['ses']=$ses;   if((2*$ses)>$skew) {    $ret['skewed']=false;   } else {    $ret['skewed']=true;   }   return $ret;}

 

    To see this method in operation, let’s look at some concrete data. Literally (http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength). I chose this dataset not because I know anything about concrete, but because it makes a good teaching case.

Taking a look at the column ‘Age’, you will see a demonstration of skewed data. The average of that column is 45.7, but it’s median is 28. Something of a big difference. Quartiles 1 and 3 are 7 and 56, respectively. The skew is 3.27, which is is more than 40 standard errors of skew (data that is more than 2 SES’s are considered skewed).

   Using averages and standard deviations on this data would make for misleading results. So after iteratively searching for a lambda of the data that would minimize skew, I picked 0.1, which would leave the data skewed 0.15 (or just under 2 SES’s). I make a new column after the untransformed data called `age0.1`, and transform each data point, and put it into the new column:

 

UPDATE `concrete` SET `age0.1` = BOXCOX ( `age `, 0.1 ); 

 

   Now that we have a new column with transformed data, the fun can really begin. The average of that data is 3.82, which doesn’t seem immediately helpful. But apply the following stored function:

 

CREATE FUNCTION `UNBOXCOX`(y FLOAT, lambda FLOAT) RETURNS floatDETERMINISTICBEGINIF lambda=0 THEN RETURN EXP(y); END IF ; IF lambda<>0THEN RETURN POW((y*lambda)+1,1/lambda);END IF;END

 

Like so:

 

SELECT UNBOXCOX ( AVG ( `age0.1` ) , 0.1 ) FROM `concrete`;

 

And you get 25.43, which is a stone’s throw from 28. Some more thorny SQL gets you quartiles:

 

UNBOXCOX( AVG(  `age0.1` ) - ( 0.68 * STD(  `age0.1` ) ) , 0.1 ) AS  `q1` , UNBOXCOX( AVG(  `age0.1` ) + ( 0.68 * STD(  `age0.1` ) ) , 0.1 ) AS  `q3`

 

of 10.99 and 55.11, which are close to 7 and 56.

 

   But let’s say you knew something about concrete, and somehow ash was important to age of the concrete. To tackle this median the old way, you would need to:

 

SELECT `age` FROM `concrete` WHERE `ash`=0 ORDER BY `age` ASC LIMIT 283, 1

 

And then you would need to repeat for all the quartiles, and when `ash`>0, and don’t forget to use the correct LIMIT based on whether `ash` is zero or greater than. However, the following query:

 

SELECT CASE WHEN  `ash` =0THEN  'No Ash'ELSE  'Yes Ash'END AS  `Ash` , UNBOXCOX( AVG(  `age0.1` ) , 0.1 ) AS  `median` , UNBOXCOX( AVG(  `age0.1` ) - ( 0.68 * STD(  `age0.1` ) ) , 0.1 ) AS  `q1` , UNBOXCOX( AVG(  `age0.1` ) + ( 0.68 * STD(  `age0.1` ) ) , 0.1 ) AS  `q3` FROM  `concrete` GROUP BY CASE WHEN  `ash` =0THEN  'No Ash'ELSE  'Yes Ash'END 

 

gets you results that you can now visually inspect and look for important differences in median ages.

   Box-Cox transformations will not typically get you true medians, but instead a result that is good enough for quick scanning of a results table. Then you can manually drill down further if important differences are noted. Box-Cox transformations will not always resolve a skew. For instance, since `ash` has more than half it’s entries as zero, a Box-Cox transformation will not assist you in finding the median. The key is how skewed the transformed data is.

   Whenever data can have a few astronomic outliers, using SQL averages may not be the correct analysis to compare groups and columns. Consider incorporating Box-Cox transformations to assist analyzing such data.

 

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like