Why does the degrees of freedom in Linear Regression equal n-p?

If we run linear regression in R, then we will see a term in the report “degrees of freedom” and that value equals n-p where n is the number of observations (data points) and p is the number of parameters (including intercept).

But what does it mean? How is it useful?

Short Answer: the degree of freedom represents the number of “free” or independent N(0,1)2 in the residuals from linear regression. The following three parts will explain this further.

1. Measure σ2

Under the Gauss-Markov model, which states yi=xiTβ+ϵi where ϵ are iid errors centered at zero with variance σ. Here, xi and β are all assumed determined. ϵi are the randomness here, which cause y to be random.

With the widely known (XTX)1XTY formula, one can derive the BLUE estimator β^ for β and obtain estimated residual ϵ^. Then, how can we estimate σ? We have RSS, which is the residual sum of squares but what’s next?

A short answer is: E(RSSnp)=σ2. Here, the denominator is the degree of freedom. Proof follows from the fact that cov(ϵ^)=σ2(InH) , where H is the projection matrix and error term centers at zero.

To make the proof easier, let us further assume error terms follow normality. Then, RSS=i=1Nϵ2^ is similar to a chi-square distribution. Note, chi-square distribution with l degree of freedom means sum of l standard normal’s square.

But they are not identically the same. Actually, RSSσ2χnp because the trace of (1-H) is n-p. This is because the eigenvalues of (1-H) is n-p and that is exactly the number of independent normal variables in the residual. E(χnp2)=np so E(RSSnp)=σ2

2. Hypothesis Testing for β^

β^βσ2^(XTX)1tnp

n-p is also the degree of freedom of t distribution for testing whether β is zero. Note, t distribution tnp=N(0,1)χnp2/(np). The n-p comes from the same chi-square distribution as we mentioned above.

But here, the assumption is that error term follows homoscedasticity. How do we estimate the standard error for β^ under heteroscedasticity? It will be in the future posts about EHW and WLS.

3. Connection to 1-sample t-test

Let us first assume that we have shifted the Y with a null hypothesized mean yi=yiY0 for all i. Then, if the linear model is Y1, which only includes the interception term, then β=Y¯, the sample average of Y. The RSS/(n-1) will be exactly the same as the sample variance of Y. We can notice that this is another reason for minus 1 in the denominator for computing sample variance. By part II, sample mean over sqrt of sample variance follows a t-distribution with df n-1.

4. Connection to 2-sample t-test

Let us assume that the linear model is Y1+X, which includes the interception term and a binary variable X only takes 0 for 1. Moreover, X is 0 for control group and X is 1 for treatment group. We want to test whether the average treatment effect or diff-in-means significant enough. If we run the regression, the coefficient for X will be the diff-in-means estimator (difference between two sample mean). The standard error of beta will be the same as that of a 2-sample t-test if you algebraically solve for it by the equation above. Furthermore, the degree of freedom will be n-2 since p equal to 2 in this case.

Jiayang Nie
Jiayang Nie
Machine Learning Engineer Intern