Preface
In general, data transformations seemed pretty arbitrary to me. I would frequently log or square-root data to make the data follow a distribution. A stickler for procedures, I often wondered, “is there a procedure for finding the best transformation?” Well, luckily there is a procedure, and I have to thank two statisticians with similar sounding names who collaborated with each other - partially because they have similar sounding names.
What is it?
The Box-Cox procedure performs a data transformation or a regression transformation.
Why am I using it?
Here, I am only interested in talking about the regression transform, not the data transform.
What does it look like?
There are several ways to represent the transform[^1], based on the original distribution of the data, but in generalities, it looks like:
$$Y(\lambda) = \Bigg\{\begin{array} 1\frac{Y^{\lambda}-1}{\lambda} & \text{if } \lambda \ne 0 \\ \log(Y) & \text{if } \lambda = 0 \end{array}$$
Here, $Y$
is a random variable, the response variable. Yes, it’s one-sided, and a transform is not performed on the explanatory variables. The assumption is that $Y^k$
behaves with a normal distribution. Therefore, the multivariate density of $Y^k$
is expressed:
$$ f_{Y^{\lambda}} = \frac{\exp(-\frac{1}{2}(y^{\lambda} - \mu_y)^T(\Sigma^{-1}) (y^{\lambda} - \mu_y))}{\sqrt{(2\pi)^k | \Sigma |}} $$
The assumption is constant variance, so $\Sigma = \sigma^2 I$
. Note $k$
is equal to the index. To get the density of $Y$
, the method of transformations (no relation to Box-Cox transformation) is used where the Jacobian is multiplied by the density function of $Y^{\lambda}$
$$ f_Y = \frac{\exp(-\frac{1}{2}(y^{\lambda} - \mu_y)^T(\Sigma^{-1}) (y^{\lambda} - \mu_y))}{\sqrt{(2\pi)^k | \Sigma |}} \prod_i^k y^{\lambda-1} $$
Taking the log,
$$ \log(f_Y) = \log\Bigg(\frac{\big(\exp(-\frac{1}{2}(y^{\lambda} - \mu_y)^T(\Sigma^{-1}) (y^{\lambda} - \mu_y))}{\sqrt{(2\pi)^k | \Sigma |}} \prod_i^k y^{\lambda-1} \Bigg) $$
$$ \log(f_Y) = -\frac{1}{2}(y^{\lambda} - \mu_y)^T(\Sigma^{-1}) (y^{\lambda} - \mu_y) - \frac{k}{2} \log(2\pi \Sigma) + (\lambda-1)\sum_i^k y_i $$
The estimates of $\mu_y, \sigma^2$
, and $\lambda$
are done by theoretically taking partials of each. In reality, the estimates can be solved for numerically. Choosing which numerical method is a whole other conversation.
An Example
The data are housing prices from the AER library. We have one response “price” and eleven predictors. Some of predictors are factors. The box-cox procedure produces:
## [1] 0.1414141
Based on likelihood profile plot above, the estimate for lambda is $0.1414$
. Now, I will check visually whether the transformation produced a better model.
A plot of the residuals look better and so does the normal plot.