Mainly based on Wooldridge (2019), Chapter 4

3.1 Topics of this chapter

This chapter deals with statistical inference in the regression model, particularly with

  • Hypothesis tests about population parameters

  • Construction of confidence intervals

For both we need the distribution of the OLS estimators for repeated samples

  • The OLS estimators \hat \beta_j and \hat \sigma^2 depend on the particular sample and are therefore random variables

  • We already know their expected values and their variances

  • However, for hypothesis tests we need to know their distribution

  • In order to derive their distribution we need additional assumptions, at least for small samples

    • MLR.6: Errors are normally distributed u_i \ \sim \ N(0, \sigma^2) which implies that the depended variable y is normally distributed as well, conditional on the values of the explanatory variables in \mathbf x

y_i \, | \, \mathbf x \ \sim \ N(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k, \, \sigma^2)


3.2 Statistical inference in general

Discussion of the normality assumption

  • The error term u_i is interpreted as the sum of many different unobserved factors

  • According to the Central Limit Theorem (CLT, see Theorem A.3), sums (means) of many independent factors tends to be normally distributed

  • Problems:

    • Are the conditions for the CLT met? In particular:

      • How many different factors? Is the number large enough?

      • Possibly very heterogeneous distributions of individual factors

      • How independent are the different factors from each other?

The normality of the error term is an empirical question

  • At least the error distribution should be “close” to normal

  • In many cases, normality is questionable or even impossible by definition

    • Examples where normality cannot strictly hold:

      • Wages (non-negative; also: minimum wage)

      • Number of arrests (takes on a small number of integer values)

      • Employed/not employed (indicator, respectively, binary dependent variable, takes on only 1 or 0)

  • In some cases, normality can be achieved through transformations of the dependent variable (e.g. use log(wage) instead of wage)

Very important: Under certain regularity conditions regarding the data and u_i the assumption of normality can be replaced by a large sample size. In that case the error term needs not to be normally distributed.


The following terminology is often used:

\underbrace{\text{MLR.1 - MLR.5}}_{\text{Gauss-Markov assumptions}} \quad \quad \underbrace{\text{MLR.1 - MLR.6}}_{\text{Classical linear model (CLM) assumptions}}


Furthermore, we have the following important theorem:

Theorem 3.1 (Normal sampling distributions) Under assumptions MLR.1 - MLR.6

\hat \beta_j \ \sim \ N \left( \beta_j, \operatorname{Var}(\hat \beta_j) \right) \tag{3.1}

That means that, given the assumptions above, the OLS estimator \hat \beta_j is unbiased and normally distributed around the true parameter with a variance that was derived earlier.

Additionally:

\dfrac {\hat \beta_j - \beta_j}{sd(\hat \beta_j)} \ \sim \ N(0,1) \tag{3.2}

This means that the standardized estimator follows a standard normal distribution


The proof is quite straightforward if we use matrix notation, see Section 2.4.1

The model in matrix notation is:

\mathbf y = \mathbf X \boldsymbol \beta + \mathbf u And the OLS estimator for \boldsymbol{\hat \beta} is (see Equation C.2):

\hat {\boldsymbol \beta} = (\mathbf X' \mathbf X)^{-1} \mathbf X' \mathbf y This can be written as (Equation C.3):

\hat {\boldsymbol \beta} = \boldsymbol \beta + \underbrace {(\mathbf X' \mathbf X)^{-1} \mathbf X' \mathbf u}_{\mathbf {Au}} \tag{3.9}

  • The typical element of the vector \mathbf {Au} is \sum_{j=1}^n a_{ij} \, u_j

    • Hence, the OLS estimator \hat {\boldsymbol \beta} is a linear combination of the normally distributed errors u_{j}

    • According to the Reproductive Property of the normal distribution, \hat {\boldsymbol \beta} therefore is normally distributed as well:

\hat {\boldsymbol \beta} \, | \, \mathbf X \ \sim \ N \left( \boldsymbol \beta, \ \sigma^2 (\mathbf X' \mathbf X)^{-1} \right)

  • Here, we used the fact that under MLR.1 - MRL.4, \hat {\boldsymbol \beta} is unbiased

  • For the derivation of the covariance matrix of \hat {\boldsymbol \beta} we used MLR.5 and formula Equation C.7


3.3 Testing hypotheses about a single population parameter

We have the following theorem:

Theorem 3.2 (t-distribution for the standardized estimators) Given MLR.1 - MLR.6, we state:

\dfrac {\hat \beta_j - \beta_j}{se(\hat \beta_j)} \ \sim \ t_{n-k-1} \tag{3.3}

  • That means that if the standardization in Equation 3.2 is done using the estimated standard deviation of \hat \beta_j, the normal distribution from Equation 3.2 is replaced by a t-distribution

    • The estimated standard deviation of \hat \beta_j is called standard error

    • The standard error, se(\hat \beta_j), is calculated with \hat \sigma^2 instead of the true \sigma^2 in Equation 2.37. Therefore, se(\hat \beta_j) is itself a random variable which is usually \chi^2 distributed

  • In this case, we actually have a ratio of two random variables in Equation 3.3 – a normal over a \chi^2 distributed random variable, which usually leads to a t-distribution. That’s our heuristic proof of Theorem 3.2

  • A very important special case arises if we test the particular null hypothesis H_0 : \beta_j=0

    • This means that we hypothesize that the true population parameter is equal to zero, i.e. there is no effect of x_j on y after controlling for the other independent variables

    • This is the so called Significance Test. In this case, presuming the H_0 is true, Equation 3.3 reduces to the t-statistic

t_{\hat \beta_j} \ \equiv \ \dfrac {\hat \beta_j - 0}{se(\hat \beta_j)} \ = \ \dfrac {\hat \beta_j}{se(\hat \beta_j)} \ \sim \ t_{n-k-1} \tag{3.4}

  • If we, more generally, hypothesize that \beta_j = a_j, the test statistic is (presuming the H_0 is true)

\dfrac {\hat \beta_j-a_j} {se(\hat \beta_j)} \ \sim \ t_{n-k-1} \tag{3.5}

Generally, the t-statistic is a standardized measure of the distance between the estimated parameter and its hypothesized value (which is zero or a_j)

In particular, it measures how many estimated standard deviations the estimated coefficient is away from the H_0 (0 or a_j)

  • The basic test idea is the following: The farther the estimated coefficient is away from the hypothesized value zero (or a_j), the less likely it is that the null hypothesis holds true

    • But what does “far away” from zero (or a_j) mean? This is measured by the t-statistic above and assessed with the help of the distribution of the t-statistic

    • If the H_0 is true, we know from Theorem 3.2 that the distribution of the t-statistic is the t-distribution (so, the name of the t-statistic is not accidentally)

      • If the degrees of freedom of the t-distribution are greater than 60 (many observations), the t-distribution is practical identical to a standardized N(0,1) normal distribution

3.3.1 Testing procedure for a single parameter

How is a test procedure actually carried out?

  1. From Theorem 3.2, we know the distribution of the test statistic

  2. We carry out the estimation of \beta_j and se(\beta_j) and calculate the test statistic Equation 3.5

  3. We want to test a hypothesis H_0 about the true value of the estimated parameter – actually we want to reject the H_0

  4. The test statistic shows how far away the estimate \hat \beta_j is from the hypothesized value

  5. To asses whether the test statistic is far away enough to reject the H_0 we construct a rejection region so that, if the null hypothesis is true, the test statistic only accidentally falls into this rejection region, for instance in only 5% of the cases. This small probability is freely chosen and called significance level

    1. The significance level \alpha is the probability of making a so called type I error or \alpha-error, which is the probability to falsely reject a true hypothesis accidentally (only due to sampling errors). In most cases \alpha=0.05 (5%) is chosen, but \alpha=0.01 (1%) or \alpha=0.1 (10%) are also quite common

    2. The boundary points of the rejection region are called critical values. These are the values which are exceeded (deceeded) in x% of the cases. Formally, we are looking for the corresponding quantile of the test statistic’s distribution

  6. Having constructed the rejection region of the test, we reject the null hypothesis H_0: \beta_j=a_j in favor of the alternative hypothesis if the test statistic falls into this rejection region. In this case the estimated value is “too far away” from the hypothesized value

  7. There are two types of alternative hypothesis H_1:

    1. H_1: \beta_j \neq a_j for a two sided or two-tailed test

    2. either H_1: \beta_j>a_j or H_1: \beta_j<a_j for a one sided or one-tailed test

    3. For one-sided tests, it is sometimes not obvious, whether one could apply a left- or right sided one. There is a simple rule: If the t-statistic is negative, we can only apply a left-sided test, and if the t-statistic is positive, we can only apply a right-sided test. Otherwise, the H_0 could never be rejected

  8. Two-tailed tests are much more common. One-tailed tests are used if the researcher is only interested in a particular “direction” of the parameter, e.g., with a unit-root test (a particular test for the stationarity of time series) one is only keen on whether the parameter of interest might be \geq 0


Code
# Plot the normal distribution on interval [-5,5]
t <- seq(-5, 5, 0.01)
cvl <- round( qnorm(0.95), digits = 2) # critical value as quantile of the norm. distrib.

plot(x = t, y = dnorm(t),  type = "l",  col = "black",  lwd = 2, 
     main = expression("One-tailed test with normal distrib. and" ~ alpha ~ "= 0.05; e.g., H"[0]: ~ beta ~ "≤ 0, H"[1]: ~ beta ~" > 0"),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=-5:5, pos = c(0,0)) 
axis(side=2) 


# critical region in right tail
polygon(x = c(cvl, seq(cvl, 5, 0.01), 5),
        y = c( 0, dnorm(seq(cvl, 5, 0.01)), 0 ), 
        col = 'cornflowerblue')


# text
text(3.5, 0.15, labels = expression( "Rejection region," ~ alpha ~ "= 0.05"),cex = 1)
arrows(3.6, 0.133, 2.3, 0.04, length = 0.08)

text(0, 0.1, labels = bquote( qn[0.95] ~ " = " ~ .(cvl)), cex = 1)
arrows(0, 0.085, cvl-0.03, 0.002, length = 0.08)

text(-3.5, 0.3, labels = expression("Distribution of the test statistic"),cex = 1)
arrows(-3.6, 0.28, -1.22, 0.2, length = 0.08)

Figure 3.1: Distribution of the t-statistic - in this case a normal distribution

Code
# Plot the normal distribution on interval [-5,5]
t <- seq(-5, 5, 0.01)

# lower and upper critical values
cvl <- round( qnorm(0.025), digits = 2 ); cvr <- round( qnorm(0.975), digits = 2)

plot(x = t, y = dnorm(t),  type = "l",  col = "black",  lwd = 2, 
     main = expression("Two-tailed test with normal distrib. and" ~ alpha ~ "= 0.05; e.g., H"[0]: ~ beta ~ "= 0, H"[1]: ~ beta ~" ≠ 0"),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=-5:5, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(-5, seq(-5, cvl, 0.01), cvl),
        y = c( 0, dnorm(seq(-5, cvl, 0.01)), 0 ), 
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr, seq(cvr, 5, 0.01), 5),
        y = c( 0, dnorm(seq(cvr, 5, 0.01)), 0 ), 
        col = 'cornflowerblue')


text(-4, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1)
arrows(-4, 0.13, -2.5, 0.026, length = 0.08)

text(4, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1)
arrows(4, 0.13, 2.5, 0.026, length = 0.08)

text(-0.5, 0.15, labels = bquote( qn[0.025] ~ " = " ~ .(cvl)), cex = 1)
arrows(-0.5, 0.135, cvl+0.03, 0.002, length = 0.08)

text(0.5, 0.1, labels = bquote( qn[0.975] ~ " = " ~ .(cvr)), cex = 1)
arrows(0.5, 0.085, cvr-0.03, 0.002, length = 0.08) 

Figure 3.2: Distribution of the t-statistic - in this case a normal distribution

Code
# Plot the t(4) Distribution on interval [-5,5]
t <- seq(-5, 5, 0.01)

# lower and upper critical values
cvl <- round( qt(0.025, 4), digits = 2 ); cvr <- round( qt(0.975, 4), digits = 2)

plot(x = t, y = dt(t, 4),  type = "l",  col = "black",  lwd = 2, 
     main = expression("Two-tailed test with t(4) distribution and" ~ alpha ~ "= 0.05; e.g., H"[0]: ~ beta ~ "= 0, H"[1]: ~ beta ~" ≠ 0"),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=-5:5, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(-5, seq(-5, cvl, 0.01), cvl),
        y = c( 0, dt(seq(-5, cvl, 0.01), 4), 0 ), 
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr, seq(cvr, 5, 0.01), 5),
        y = c( 0, dt(seq(cvr, 5, 0.01), 4), 0 ), 
        col = 'cornflowerblue')


text(-4.5, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1)
arrows(-4.5, 0.13, -3.8, 0.02, length = 0.08)

text(4.5, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1)
arrows(4.5, 0.13, 3.8, 0.02, length = 0.08)

text(-2.2, 0.18, labels = bquote( qt[0.025] ~ " = " ~ .(cvl)), cex = 1)
arrows(-2.2, 0.165, cvl+0.03, 0.0015, length = 0.08)

text(0.55, 0.09, labels = bquote( qt[0.975] ~ " = " ~ .(cvr)), cex = 1)
arrows(0.65, 0.075, cvr-0.03, 0.002, length = 0.08) 

Figure 3.3: Distribution of the t-statistic - in this case a t-distribution
Code
# Plot the t(3) Distribution on interval [-5,5]
t <- seq(-5, 5, 0.01)

# lower and upper critical values
cvl <- round( qt(0.01, 4), digits = 2 ); cvr <- round( qt(0.99, 4), digits = 2)

plot(x = t, y = dt(t, 4),  type = "l",  col = "black",  lwd = 2, 
     main = expression("Two-tailed test with t(4) distribution and" ~ alpha ~ "= 0.02; e.g., H"[0]: ~ beta ~ "= 0, H"[1]: ~ beta ~" ≠ 0"),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=-5:5, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(-5, seq(-5, cvl, 0.01), cvl),
        y = c( 0, dt(seq(-5, cvl, 0.01), 4), 0 ), 
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr, seq(cvr, 5, 0.01), 5),
        y = c( 0, dt(seq(cvr, 5, 0.01), 4), 0 ), 
        col = 'cornflowerblue')


text(-4.5, 0.15, labels = expression("0.01"~"="~over(alpha, 2)),cex = 1)
arrows(-4.5, 0.13, -4.1, 0.015, length = 0.08)

text(4.5, 0.15, labels = expression("0.01"~"="~over(alpha, 2)),cex = 1)
arrows(4.5, 0.13, 4.1, 0.015, length = 0.08)

text(-2.2, 0.18, labels = bquote( qt[0.01] ~ " = " ~ .(cvl)), cex = 1)
arrows(-2.2, 0.165, cvl+0.03, 0.0015, length = 0.08)

text(0.55, 0.09, labels = bquote( qt[0.99] ~ " = " ~ .(cvr)), cex = 1)
arrows(0.65, 0.075, cvr-0.03, 0.002, length = 0.08) 

Figure 3.4: Distribution of the t-statistic - in this case a t-distribution

3.3.2 Example for tests - wage equation

We first estimate the model – in this case for explaining the received wage of persons
library(AER);  library(wooldridge);  data("wage1")

output <- summary( lm(lwage ~ educ + tenure + exper, data = wage1)  )
output
      
      Call:
      lm(formula = lwage ~ educ + tenure + exper, data = wage1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -2.05802 -0.29645 -0.03265  0.28788  1.42809 
      
      Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
      educ        0.092029   0.007330  12.555  < 2e-16 ***
      tenure      0.022067   0.003094   7.133 3.29e-12 ***
      exper       0.004121   0.001723   2.391  0.01714 *  
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.4409 on 522 degrees of freedom
      Multiple R-squared:  0.316,   Adjusted R-squared:  0.3121 
      F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16

We want to test whether tenure has any effect on lwage
  • Significance Test: H0: b3 = 0, H1: b3 ≠ 0, two-tailed

  • Choosing the significance level alpha = 5%

alpha <- 0.05
# printing the coefficients, standard errors and t-statistic
coef(output)
                     Estimate  Std. Error   t value     Pr(>|t|)
      (Intercept) 0.284359541 0.104190379  2.729230 6.562466e-03
      educ        0.092028988 0.007329923 12.555246 8.824197e-32
      tenure      0.022067218 0.003093649  7.133070 3.294407e-12
      exper       0.004121109 0.001723277  2.391437 1.713562e-02
# "the" t-statistic:  b3 / se(b3)
coef(output)[3,1] / coef(output)[3,2]
      [1] 7.13307
# the 2.5% and 97.5% quantile of the standardized normal distribution: rejection region in both tails

# Rule of Thumb for two-tailed tests: reject the H0 if |t-statistic| ≥ 2
c( qnorm(alpha/2), qnorm(1-alpha/2) )
      [1] -1.959964  1.959964

The t-statistic is about 7, which means that the estimated coefficient for tenure is about seven times the standard deviation of the estimate away from 0.

The critical values are about \pm2.

Hence, the t-statistic clearly falls into the rejection region, thus the H_0 of no effect of tenure on lwage is overwhelming rejected. So, H_1 seems to be true, subject to a small type I error of course.


We want to test whether tenure has really a positive effect on lwage
  • One sided test: H0: b3 ≤ 0, H1: b3 > 0 – right-tailed

  • We choose a significance level alpha = 5%

alpha <- 0.05

# printing the coefficients, standard errors and t-statistic
coef(output)
                     Estimate  Std. Error   t value     Pr(>|t|)
      (Intercept) 0.284359541 0.104190379  2.729230 6.562466e-03
      educ        0.092028988 0.007329923 12.555246 8.824197e-32
      tenure      0.022067218 0.003093649  7.133070 3.294407e-12
      exper       0.004121109 0.001723277  2.391437 1.713562e-02
# the t-statistic: b3 / se(b3)
coef(output)[3,1] / coef(output)[3,2]
      [1] 7.13307
# the 95% quantil of the standardized normal distribution: rejection region only in the right tail
qnorm(1-alpha)
      [1] 1.644854

The t-statistic is about 7, which means that the estimated coefficient for tenure is about seven times the standard deviations of the estimate away from 0.

The critical values is about 1.64.

Hence, the t-statistic clearly falls into the rejection region, thus the H_0 of no or a negative effect of tenure on lwage is overwhelming rejected. So, H_1 seems to be true, subject to a small type I error of course.


We want to test whether educ has an effect on lwage of at least 10%:
  • H0: b2 ≥ 0.1, H1: b2 < 0.1 – left-tailed (because estimated coefficient is less than 0.1)

  • Choosing the significance level alpha = 5%

alpha <- 0.05
#| comment: "     "

# printing the coefficients, standard errors and t-statistic
coef(output)
                     Estimate  Std. Error   t value     Pr(>|t|)
      (Intercept) 0.284359541 0.104190379  2.729230 6.562466e-03
      educ        0.092028988 0.007329923 12.555246 8.824197e-32
      tenure      0.022067218 0.003093649  7.133070 3.294407e-12
      exper       0.004121109 0.001723277  2.391437 1.713562e-02
# the t-statistic: (b2 - 0.1) / sd(b2) 
( coef(output)[2,1] - 0.1 ) / coef(output)[2,2]
      [1] -1.087462
# the 5% quantil of the standardized normal distribution: rejection region in the left tail
qnorm(alpha)
      [1] -1.644854

The t-statistic does not fall into the rejection region.

Therefore H_0 is not rejected - so not enough evidence to be sure of H_1

However, b2 < 0.1, H_1, might be true anyway – type II error


3.4 Types of errors and power of statistical tests

  • There are principally two types of errors in statistical tests

    • Type I error or \alpha-error

      • This is the probability to falsely reject a true H_0 due to sampling errors. This probability is chosen via the significance level by the researcher. So, why not choosing a very low significance level \alpha to minimize this error? In this case, rejecting the H_0 would give us more confidence regarding the rejection. But the downside is that a low significance level increases another type of error
    • Type II error or \beta-error

      • This is the probability of accepting a false H_0. Looking at Figure 3.3 and Figure 3.4 it is pretty clear that reducing the significance level, for instance from \alpha= 0.05 to 0.02, shrinks the rejection region and therefore enlarges the acceptance region. In this case, the probability of rejecting the H_0 in repeated samples is decreased even if the H_0 is false
    • The power of a test is the probability to reject a false H_0. Hence, this is one minus the probability of making a type II error and one of the most important and desirable properties of a test – it is the ability of the test to discriminate between true and false hypothesis

      • Besides the significance level and the standardized distance of the hypothesized value from the estimated one, all factors which reduce the standard errors of the estimates increase the power of a test; most notably a large number of observations and a low multicollinearity respectively, large variations of the explanatory variables

3.5 p-value

The classical test strategy of

  1. choosing a significance level

  2. calculating the test statistic

  3. looking whether the test statistic is within the rejection region
    (i.e. whether |test statistic| ≥ |critical value|)

is somewhat arbitrary as the result of the test depends on the chosen significance level

  • As an example, suppose the observed t-statistic for a significance test, i.e. H_0: \beta_j=0 versus H_1: \beta_j≠0, equals 1.85

    • Using a significance level of 0.05, the H_0 cannot be rejected – the corresponding critical values, q_{0.025}, q_{0.975}, are about ±2

    • However, using a significance level of 0.1, the H_0 can be rejected – the corresponding critical values, q_{0.05}, q_{0.95}, are about ±1.65

    • So, what conclusions should be drawn?


  • The concept of the p-value provides an alternative to this and answers the following question: given the observed t-statistic, what is the smallest significance level at which the H_0 would be rejected?

  • This has the following interpretation:

    • The p-value is the probability of observing a t-statistic as we did (or even larger in absolute terms) if the H_0 is true

    • Hence, a small p-value is evidence against the H_0 as the actually observed (large) t-statistic is very unlikely a result of sampling errors in this case, and vice verse

    • Alternatively: A small p-value is evidence against the null hypothesis because one can reject the null hypothesis even at small significance levels (small error probability)

  • To calculate a p-value by hand is seldom necessary (the econometric package usually does the work). It could be done the following way, which, in a sense, is a reversal of the classical test strategy

    • Construct the rejection region, using ± of the t-statistic as critical values

    • Calculate the implied significance level. This is done with the cdf() (cumulative distribution function). For our example, the implied significance level is: \operatorname {cdf}(-1.85) + (1 - \operatorname {cdf}(1.85))=0.072.
      The R-code for an assumed t_{40} distribution is [1-pt(\operatorname {abs}(1,85), 40)] \times 2=0.072

  • The following Figure Figure 3.5 illustrates the concept of the p-value


Code
# Plot of prob.value

b=0    # beta hat
sd=1  # se(beta hat)
t=1.85
df=40
a=round(  (1-pt(abs(t), df))*2, digits = 3 ) # prob value
#a=0.036   # significance level#

beg=round(-4)
end=round(4)
maxh=dnorm(0)

x <- seq(beg, end, 0.01)
cvl <- -t
cvr <- t

plot(x = x, y = dt(x, df),  type = "l",  col = "black",  lwd = 2, 
     main = bquote( "Observed t-statistic = 1.85," ~  t[40] ~ "distrib. => p-Value =  [1 - pt(| t |, 40)]*2 = " ~ .(round( a, digits=3))),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=beg:end, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(beg, seq(beg, cvl, 0.01), cvl),
        y = c( 0, dt(seq(beg, cvl, 0.01), df), 0 ),
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr, seq(cvr, end, 0.01), end),
        y = c( 0, dt(seq(cvr, end, 0.01), df), 0 ),
        col = 'cornflowerblue')



# left
text(beg+1, maxh*0.3, labels = bquote( over(alpha,2) ~ " =  1 - pt(| t |, 40)  = " ~ .(a/2) ), cex = 1)
arrows(beg+1, maxh*0.26, cvl*1.18  , maxh*0.045, length = 0.08)
# right
text(end-1, maxh*0.3, labels = bquote( over(alpha,2) ~ " =  1 - pt(| t |, 40) = " ~ .(a/2) ), cex = 1)
arrows(end-1, maxh*0.26, cvr*1.18, maxh*0.045, length = 0.08)
# acceptance 
text(end-1, maxh*0.85, labels = bquote( "(1 -" ~ alpha ~ ") = " ~ .(1-a) ), cex = 1)
arrows(end-1, maxh*0.81, end*0.1, maxh*0.6, length = 0.08)


# left
text( -0.8, maxh*0.35, labels = bquote( .(-t)), cex = 1)
arrows(-0.8, maxh*0.31, -t, maxh*0.003, length = 0.08)
# right
text(0.8, maxh*0.35, labels = bquote( .(t)), cex = 1)
arrows(0.8, maxh*0.31, t, maxh*0.003, length = 0.08)

text(-2.8, 0.3, labels = bquote("Distribution of the test statistic," ~ t[40]),cex = 1)
arrows(-2.8, 0.28, -1.12, 0.22, length = 0.08)

Figure 3.5: Distribution of the t-statistic

3.6 General remarks on parameter testing

Guidelines for discussing economic and statistical significance

  • All statistical tests described in this chapter require at least MLR.1 - MLR.5 (for small samples, we additionally need MLR.6). Violations of some assumptions need special treatments

  • If a variable is statistically significant, discuss the magnitude of the coefficient to get an idea of its economic or practical importance

  • The fact that a coefficient is statistically significant does not necessarily mean it is economically or practically significant!

    • This is especially true for data sets with very many observations. For instance, if you have thousands of observations in a cross section data set, nearly every effect is statistically significant, even the most trivial ones
  • If a variable is statistically and economically important but has the “wrong” sign, the regression model might be misspecified

  • If a variable is statistically insignificant at the usual levels (10%, 5%, or 1%), one may think of dropping it from the regression

    • However, if the sample size is small, effects might be imprecisely estimated so that the case for dropping insignificant variables is less strong

3.7 Confidence intervals

Until now, we have dealt with point estimates of the unknown parameters. But according Theorem 3.1, we know that

{\hat \beta_j} \ \sim \ N\left(\beta_j, \, sd(\hat \beta_j)^2\right)

Now, imagine we want to construct an interval estimate of the true parameter \beta_j which covers the true parameter with a probability of say, 95% (provided our model is correctly specified, of course)

  • To do so, we need the 0.025 and 0.975 quantils of this distribution. It turns out that this quantils are simply the ones of the standardized normal distribution – or the t distribution, if \sigma^2 is unknown and has to be estimated with only a few observations – times the estimated standard deviation of \hat \beta_j, se(\hat \beta_j)

  • This argument leads to the so called 95% Confidence Intervals, which covers the true parameter with a probability of 95%. As the distribution at hand is symmetric, we have q_{0.025}=-q_{0.975} and therefore

\beta_j \ \ \underset {95\%} \in \ \ \left[ \hat \beta_j \ ± \ q_{0.975} \times se(\hat \beta_j) \right] \tag{3.6}

This way, we heuristically constructed an interval estimate for \beta_j, considering sample variability. Thereby, q_{0.975} denotes the 0.975 quantil of a standardized normal or a t_{n-k-1} distribution (for fewer than 60 degrees of freedom)


  • As the 0.975 quantil is about 2 of most cases (except for very low degrees of freedom), we can approximate the 95% confidence interval by a very simple formula, which is often used in practice

\beta_j \ \ \underset {95\%} \in \ \ \left[ \hat \beta_j \ ± \ 2 \times se(\hat \beta_j) \right] \tag{3.7}

  • For a 99% confidence interval, use 2.6, and for a 90% confidence interval use 1.65 instead of 2. These approximations work very well for more than 60 degrees of freedom

  • For a more formal derivation of confidence intervals, we start with Theorem 3.2 which implies:

P\left(-q_{0.975} \ < \ \dfrac{\hat{\beta}_{j}- \beta_j }{ se(\hat \beta_j)} \ < \ q_{0.975}\right)=0.95 \ \ \Rightarrow\left[\times se(\hat \beta_j)\right] \\ P\left(-q_{0.975} \times se(\hat \beta_j) \ < \ \hat{\beta}_{j}- \beta_j \ < \ q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow\left[-\hat{\beta}_{j}\right] \\ P\left(-q_{0.975} \times se(\hat \beta_j)-\hat{\beta}_{j} \ < \ -\beta_j \ < \ -\hat{\beta}_{j}+q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow[ \times -1] \\ P\left(q_{0.975} \times se(\hat \beta_j)+\hat{\beta}_{j} \ > \ \beta_j > \ \hat{\beta}_{j}-q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow

P\left(\hat{\beta}_{j}-q_{0.975} \times se(\hat \beta_j) \ < \ \beta_{j} \ < \ \hat{\beta}_{j}+q_{0.975} \times se(\hat \beta_j)\right) = 0.95


Code
# Plot of 95% confidence interval

b=2.3    # beta hat
sd=0.96  # se(beta hat)
a=0.05   # significance level#

beg=round(b-4*sd)
end=round(b+4*sd)
maxh=dnorm(b, b, sd)

x <- seq(beg, end, 0.01)
cvl <- round( qnorm(a/2, 0, 1), digits = 3 )
cvr <- round( qnorm(1-a/2, 0, 1), digits = 3)

plot(x = x, y = dnorm(x, b, sd),  type = "l",  col = "black",  lwd = 2, 
     main = bquote( .((1-a)*100) ~ "% Convidence Interval; e.g.," ~  hat(beta) ~ "=" ~ .(b) ~ ", " ~ se(hat(beta)) ~ "=" ~ .(sd)),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=beg:end, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(beg, seq(beg, cvl*sd + b, 0.01), cvl*sd + b),
        y = c( 0, dnorm(seq(beg, cvl*sd + b, 0.01), b, sd), 0 ), 
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr*sd + b, seq(cvr*sd + b, end, 0.01), end),
        y = c( 0, dnorm(seq(cvr*sd + b, end, 0.01), b, sd), 0 ), 
        col = 'cornflowerblue')


# confidence region
polygon(x = c(cvl*sd + b, seq(cvl*sd + b, cvr*sd+b, 0.01), cvr*sd+b),
        y = c( 0, dnorm(seq(cvl*sd + b, cvr*sd+b, 0.01), b, sd), 0 ), 
        col = 'gray99')


# Interval
segments( x0=b-abs(cvl)*sd, y0=0.0, x1=b+abs(cvl)*sd, y1=0.0, lwd=5, col = "gray") 
rug( c(b-abs(cvl)*sd, b+abs(cvl)*sd ), ticksize  = 0.08, lwd = 6, col = "darkgray")
rug( b, ticksize  = 0.05, lwd = 1.5)


# left
text(beg, maxh*0.3, labels = bquote( .(a/2) ), cex = 1)
arrows(beg, maxh*0.26, b+cvl*sd*1.18  , maxh*0.02, length = 0.08)
# right
text(end, maxh*0.3, labels = bquote( .(a/2) ), cex = 1)
arrows(end, maxh*0.26, b+cvr*sd*1.18, maxh*0.02, length = 0.08)

text(end, maxh*0.85, labels = bquote( .(1-a) ), cex = 1)
arrows(end, maxh*0.81, b+(end-b)*0.1, maxh*0.6, length = 0.08)



# left
text( beg+(b-beg)/4, maxh*0.45, labels = bquote( hat(beta) ~ "-" ~.(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b-abs(cvl)*sd)), cex = 1)
arrows(beg+(b-beg)/4, maxh*0.4, cvl*sd+b, maxh*0.01, length = 0.08)
# right
text(end-(end-b)/4, maxh*0.45, labels = bquote( hat(beta) ~ "+" ~ .(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b+cvr*sd)), cex = 1)
arrows(end-(end-b)/4, maxh*0.4, cvr*sd+b, maxh*0.01, length = 0.08)


# beta hat
text(b-0.1, maxh*0.2, labels = bquote(hat(beta) ~ "=" ~ .(b)), cex = 1.4)
arrows(b-0.1, maxh*0.16, b, maxh*0.025, length = 0.08) 

text(-1.0, 0.35, labels = expression("Estimated distribution of" ~ hat(beta)),cex = 1)
arrows(-1.1, 0.33, 1.11, 0.2, length = 0.08)

Figure 3.6: Conficence Interval

Code
# Plot of 99% confidence interval

b=2.3    # beta hat
sd=0.96  # se(beta hat)
a=0.01   # significance level#

beg=round(b-4*sd)
end=round(b+4*sd)
maxh=dnorm(b, b, sd)

x <- seq(beg, end, 0.01)
cvl <- round( qnorm(a/2, 0, 1), digits = 3)
cvr <- round( qnorm(1-a/2, 0, 1), digits = 3)

plot(x = x, y = dnorm(x, b, sd),  type = "l",  col = "black",  lwd = 2, 
     main = bquote( .((1-a)*100) ~ "% Convidence Interval; e.g.," ~  hat(beta) ~ "=" ~ .(b) ~ ", " ~ se(hat(beta)) ~ "=" ~ .(sd)),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=beg:end, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(beg, seq(beg, cvl*sd + b, 0.01), cvl*sd + b),
        y = c( 0, dnorm(seq(beg, cvl*sd + b, 0.01), b, sd), 0 ), 
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr*sd + b, seq(cvr*sd + b, end, 0.01), end),
        y = c( 0, dnorm(seq(cvr*sd + b, end, 0.01), b, sd), 0 ), 
        col = 'cornflowerblue')


# confidence region
polygon(x = c(cvl*sd + b, seq(cvl*sd + b, cvr*sd+b, 0.01), cvr*sd+b),
        y = c( 0, dnorm(seq(cvl*sd + b, cvr*sd+b, 0.01), b, sd), 0 ), 
        col = 'gray99')


# Interval
segments( x0=b-abs(cvl)*sd, y0=0.0, x1=b+abs(cvl)*sd, y1=0.0, lwd=5, col = "gray") 
rug( c(b-abs(cvl)*sd, b+abs(cvl)*sd ), ticksize  = 0.08, lwd = 6, col = "darkgray")
rug( b, ticksize  = 0.05, lwd = 1.5)


# left
text(beg, maxh*0.3, labels = bquote( .(a/2) ), cex = 1)
arrows(beg, maxh*0.26, b+cvl*sd*1.18  , maxh*0.02, length = 0.08)
# right
text(end, maxh*0.3, labels = bquote( .(a/2) ), cex = 1)
arrows(end, maxh*0.26, b+cvr*sd*1.18, maxh*0.02, length = 0.08)

text(end, maxh*0.85, labels = bquote( .(1-a) ), cex = 1)
arrows(end, maxh*0.81, b+(end-b)*0.1, maxh*0.6, length = 0.08)



# left
text( beg+(b-beg)/4, maxh*0.45, labels = bquote( hat(beta) ~ "-" ~.(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b-abs(cvl)*sd)), cex = 1)
arrows(beg+(b-beg)/4, maxh*0.4, cvl*sd+b, maxh*0.01, length = 0.08)
# right
text(end-(end-b)/4, maxh*0.45, labels = bquote( hat(beta) ~ "+" ~ .(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b+cvr*sd)), cex = 1)
arrows(end-(end-b)/4, maxh*0.4, cvr*sd+b, maxh*0.01, length = 0.08)


# beta hat
text(b-0.1, maxh*0.2, labels = bquote(hat(beta) ~ "=" ~ .(b)), cex = 1.4)
arrows(b-0.1, maxh*0.16, b, maxh*0.025, length = 0.08) 

text(-1.0, 0.35, labels = expression("Estimated distribution of" ~ hat(beta)),cex = 1)
arrows(-1.1, 0.33, 1.11, 0.2, length = 0.08)

Figure 3.7: Conficence Interval

3.7.1 Interpretation of the confidence interval

  • A confidence interval is constructed in the way that in repeated samples it will cover the population regression coefficient (the true value) in 95% of the cases

  • Hence, the bounds of the a confidence interval are random

  • We further have an important relationship between confidence intervals and hypotheses tests:

a_j \notin interval \ \Rightarrow \ \text{ reject } H_0: \beta_j=a_j \, \text{ in favour of } H_1: \beta_j \neq a_j

  • This means that if a specific value is not in the confidence interval, we can reject the null hypothesis in a two-tailed test that \beta_j is equal to this value

  • Hence, calculating confidence interval is an alternative way to carry out parameter tests

    • Likewise, this equivalence to parameter tests implies that violations of assumptions MLR.1 – MLR.5 (MLR.6) may lead to invalid confidence intervals

Simulating Confidence Intervals

To demonstrate the interpretation of confidence intervals from above, we simulate the following simple regression model

y = 3 + 1 \times x + u with a sample size of n=100 and calculate the estimate for \beta_1 – with a true value of one – and the estimated corresponding 95% confidence interval with

lower \ bound \ = \ \hat \beta_1 - q_{0.975} \times se(\hat \beta_1) \\ \ upper \ bound \ = \ \hat \beta_1 + \underbrace {q_{0.975}}_{1.984} \times se(\hat \beta_1) - This whole procedure is repeated 500 times. We plot the first 100 of the calculated 500 confidence intervals and look how many of them cover the true value of \beta=1

  • Result: Regarding the whole 500 simulated intervals, 94.4% of them cover the true value of one. From the first 100 plotted, 95 cover and 5 don’t cover the true value. This is in a nearly perfect accordance with the expectations
Code
library(AER) # we need coeftest()
set.seed(1)  # set seed

rep <- 500  # number of calculated confidence intervals
n <- 100    # sample size of regressions
nrep <- 100 # first nrep confidence intervals of the simulated rep, which are plotted  

# initialize vectors of lower and upper interval boundaries
lower <- numeric(rep)
upper <- numeric(rep)

# loop of running regressions / estimation / CI
for(i in 1:rep) {
  # model
  x <- rnorm(n, mean=5, sd=5)
  u <- rnorm(n)
  y = 3 + x + u
  
  # regression
  ols <- lm(y ~ x)
  
  # CIs
  ctest <-  coeftest(ols)
  lower[i] <- ctest[2,1] - qt(1-0.05/2, n) * ctest[2,2]  
  upper[i] <- ctest[2,1] + qt(1-0.05/2, n) * ctest[2,2]
}
Code
# merge vectors of lower and upper confidence interval bounds into a matrix with two columns
CIs <- cbind(lower, upper)

# fraction of how many confidence intervals cover true value of 1
print( paste( mean( CIs[, 1] <= 1 & 1 <= CIs[, 2] )*100, "% of the simulated values are covered by the confidence interval" ) )
[1] "94.4 % of the simulated values are covered by the confidence interval"
Code
# identify intervals not covering true value of 1 for the first nreps cases
ID <- which( !(CIs[1:nrep, 1] <= 1 & 1 <= CIs[1:nrep, 2]) )
print( paste("The following CI does not cover the true value:", ID) )
[1] "The following CI does not cover the true value: 5" 
[2] "The following CI does not cover the true value: 14"
[3] "The following CI does not cover the true value: 23"
[4] "The following CI does not cover the true value: 27"
[5] "The following CI does not cover the true value: 66"
Code
# initialize the plot for first 100 CIs
plot(0, 
     xlim = c(1, nrep),
     ylim = c( 1 - sd(CIs[,1])*4.3, 1 + sd(CIs[,2])*4.3), 
     xlab = "Confidence intervals in repeated random samples", 
     ylab = expression(beta[1]~"="~1), 
     main = bquote("Simulating" ~ .(nrep) ~ "confidence intervals;" ~ 
                     .(length(ID)) ~ "out of" ~ .(nrep) ~ "do not cover the true value 1") )

# set up color vector
colors <- rep( "grey60", nrep) # color grey for all 100 CIs 
colors[ID] <- "salmon"         # color salman for CIs not covering the true value

# draw reference line at meta1=1
abline(h = 1)

# add first nrep=100 confidence intervals with corresponding x,y coordinates 
for(j in 1:nrep) {
  lines(x = c(j,j), y = c(CIs[j, 1], CIs[j, 2]),  
        col = colors[j], 
        lwd = 3.6)
}

Figure 3.8: Simulated confidence intervals

3.7.2 Confidence Intervals – Example

library(AER);  library(wooldridge);  data("wage1")

# we once more estimate the our wage model
summary(  out <- lm(lwage ~ educ + tenure + exper, data = wage1)  )
      
      Call:
      lm(formula = lwage ~ educ + tenure + exper, data = wage1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -2.05802 -0.29645 -0.03265  0.28788  1.42809 
      
      Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
      educ        0.092029   0.007330  12.555  < 2e-16 ***
      tenure      0.022067   0.003094   7.133 3.29e-12 ***
      exper       0.004121   0.001723   2.391  0.01714 *  
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.4409 on 522 degrees of freedom
      Multiple R-squared:  0.316,   Adjusted R-squared:  0.3121 
      F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16

# Now we calculate the 95% confidence intervals for all parameters with the R function confint() 
confint(out)
                         2.5 %     97.5 %
      (Intercept) 0.0796755675 0.48904351
      educ        0.0776292151 0.10642876
      tenure      0.0159896854 0.02814475
      exper       0.0007356984 0.00750652

As we can see, 0.1 is contained in the 95% CI for educ. Therefore, a 10% effect of one more year of education cannot be ruled out with a significance level of 5%. This is the same result as we get earlier with a corresponding t-test

# Calculating the 99% confidence intervals for all parameters
confint(out, level = 0.99)
                          0.5 %      99.5 %
      (Intercept)  0.0149981868 0.553720895
      educ         0.0730790806 0.110978896
      tenure       0.0140692670 0.030065169
      exper       -0.0003340459 0.008576264

3.8 Testing multiple linear restrictions

To give a motivation for this section it is best to start with an example

  • Suppose, we want to investigate whether the salary of baseball player depend on their achievements

\begin{align*} log(salary) = \beta_0 &+ \beta_1\cdot years + \beta_2 \cdot gamesyr \\ &+ \beta_3 \cdot bavg + \beta_4\cdot hrunsyr + \beta_5\cdot rbisyr + u \end{align*}

  • years = number of years as an active player

  • gamesyr = number of games played per year

  • bavg = batting average

  • hrunsyr = homeruns per year

  • rbisyr = runs batted in per year

  • Data set mlb1 from the Wooldridge package, n = 353 major league baseball players

  • The last three variables are measures of performance. We want to test whether the salaries of the players depend on these, given the years as player and the frequency of run-outs

  • Our null hypothesis is H_0: \beta_3=0,\ \beta_4=0, \ \beta_5=0 against the H_1: at least one parameter ≠ 0

  • As we have to deal with multiple restrictions in this case, our usual t-test is not applicable any more


  • One possible general approach is as follows – Wald or F Tests

    • We first estimate the model without imposing the restrictions; this is the unrestricted version

    • After that, we estimate the model again, this time enforcing the restrictions; this is the restricted version (actually, this step is not really required for a Wald test, but it is easier to explain the test idea this way)

      • In the case of zero restrictions, like \beta_j=0, \, \ldots, (called exclusion restrictions), this is accomplished very easily by simply dropping the corresponding variables from the model

      • In the case of more complicated restrictions involving several parameters, like \beta_i - \beta_j=0 or \beta_i+\beta_j=1, a reparameterization of the model is necessary: For the first case we could define a new parameter \theta=\beta_i - \beta_j and plug in for \beta_i=\theta+\beta_j into the model and collect the variables accordingly. Afterwards, we drop the variable corresponding to \theta

  • Clearly, imposing restrictions on the model leads to a poorer fit. The test idea is that the H_0 has to be rejected if the fit of the model worsens too much. In this case, the data, along with the model of course, are apparently not compatible with the null hypothesis

  • Hence, we are looking for a test statistic which appropriately measures the decrease in the model’s fit caused by the imposition of the restrictions and from which we know the distribution


The F-staistic

  • The following test statistic does the job

F \ = \ \frac{\left(SSR_r-SSR_{ur} \right) / m}{SSR_{ur} /(n-k-1)} \ \sim \ F(m, n-k-1) \tag{3.8}

Thereby SSR \equiv \sum\nolimits _{i=1}^n \hat u_i^2 is defined for both the restricted and unrestricted model

  • This statistic measures the percentage decease in the fit (the percentage increase in SSR), corrected for the degrees of freedom of the nominatorm is number of restrictions – and of the denominator(n-k-1) are the degrees of freedom in calculating \hat \sigma_{ur}^2

  • There is an equivalent version of this test statistic in terms of the R^2 instead of SSR. But this version is only applicable if the dependent variable is the same for both the restricted and unrestricted model

F \ = \ \frac{(R_{ur}^2 - R_{r}^2) / m}{(1-R_{ur}^2) / (n-k-1)} \ \sim \ F(m, n-k-1) \tag{3.9}

  • This measures the decrease in the model’s fit in terms of a decrease in the R^2

  • Note, for both versions the nominator is always positive, as a (binding) restriction always reduces the model’s fit

  • One can show that the test statistic introduced above is F distributed with m and (n-k-1) degrees of freedom, as both the nominator as well as the denominator are (scaled) \chi^2 distributed random variables, which is the definition of a F-distribution. This test statistic is simply called F-statistic

  • Important Remark: The t-statistic introduced above is a special case of the F-statistic if we only test one restriction, i.e., m=1. Actually, in this case, the F-statistic is simply the square of the t-statistic. Therefore, any (two-tailed) t-test can also be formulated as a F-test


Wald statistic

  • As the denominator of the F-statistic is simply the estimate of \sigma_{ur}^2, which can be consistently estimated by OLS, there is an asymptotic version of the F-statistic, which we call W. Thereby, \hat \sigma_{ur}^2 in F is replaced by \sigma_{ur}^2 (assuming, the number of observations is large enough so that the estimate for \sigma_{ur}^2 has already converged to the true value) and F is further multiplied by m

W \ = \ \frac{\left(SSR_r-SSR_{ur} \right)}{\sigma_{ur}^2} \ \sim \ \chi^2(m) \tag{3.10}

  • W, which is called Wald statistic, is \chi^2 distributed as the denominator is no random variable any more but simply a scaling variable required so that W is actually \chi^2 distributed

  • Many econometric software packages let the user choose whether the F or \chi^2 version of the test should be used


3.8.1 Carrying out the test

  • Carrying out a F-test is basically the same the procedure as with a t-test

    • We postulate a null hypothesis

    • We determine the critical value as 1 minus the 0.95 quantil of the corresponding F or \chi^2 distribution

    • We calculate the test statistic and check whether it falls into the rejection region, i.e., whether the test statistic is larger than the critical value

    • Alternatively, we can calculate the p-value corresponding to the test statistic

  • The graphics below illustrates this procedure

    • Note, that the F and \chi^2 distribution are very similar in this case because the denominator’s degrees of freedom is quite high (347). The numerator’s degrees of freedom is 3 – we test for three restrictions;

      • Both test statistics correspond to the baseball player example presented above. Further note the different scaling of the two distributions which amounts to 3

      • As shown after the figures, the F-statistic and W-statistic for the baseball player example are 9.55 and 28.65, respectively. Is the H_0 rejected?


Code
# Plot the F(3,100) Distribution on interval [-5,5]
df1 <- 3
df2 <- 353-6
end <- 5
a <- 0.05
hmax <- df( qf(0.3, df1, df2), df1, df2)  # not really hmax
t <- seq(0, 5, 0.01)
cv <- round( qf(1-a, df1, df2), digits = 2 )

plot(x = t, y = df(t, df1, df2),  type = "l",  col = "black",  lwd = 2, 
     main = bquote("Test with F(" ~ .(df1) ~ ", " ~ .(df2) ~ ") distribution and" ~ alpha ~ "=" ~ .(a)),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=0:end, pos = c(0,0) )
axis(side=2) 

# critical region in right tail
polygon(x = c(cv, seq(cv, 5, 0.01), 5),
        y = c( 0, df(seq(cv, 5, 0.01), df1, df2), 0 ), 
        col = 'cornflowerblue')




text(end*0.75, hmax/2, labels = bquote(alpha ~ "=" ~ .(a)),cex = 1)
arrows(end*0.75, hmax/2.2, end*0.65, hmax*0.05, length = 0.1)

text(end*0.15, hmax/2.8, labels = bquote( qf[0.95] ~ " = " ~ .(cv)), cex = 1)
arrows(end*0.15, hmax/3.2, cv*0.99, hmax*0.01, length = 0.08) 

Figure 3.9: F-distribution

Code
# Plot the ChiSq(3) Distribution on interval [-5,5]
df <- 3
a <- 0.05
end <- df*5
hmax <- dchisq( qchisq(0.3, df), df)  # not really hmax
t <- seq(0, end, 0.01)
cv <- round( qchisq(1-a, df), digits = 2 )

plot(x = t, y = dchisq(t, df),  type = "l",  col = "black",  lwd = 2, 
     main = bquote("Test with ChiSq(" ~ .(df) ~ ") distribution and" ~ alpha ~ "=" ~ .(a)),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=0:end, pos = c(0,0) )
axis(side=2) 

# critical region in right tail
polygon(x = c(cv, seq(cv, end, 0.01), end),
        y = c( 0, dchisq(seq(cv, end, 0.01), df), 0 ), 
        col = 'cornflowerblue')

text(end*0.75, hmax/2, labels = bquote(alpha ~ "=" ~ .(a)),cex = 1)
arrows(end*0.75, hmax/2.2, end*0.65, hmax*0.05, length = 0.1)

text(end*0.15, hmax/2.8, labels = bquote( qchisq[0.95] ~ " = " ~ .(cv)), cex = 1)
arrows(end*0.15, hmax/3.2, cv*0.99, hmax*0.01, length = 0.08) 

Figure 3.10: Chi-distribution

3.8.2 Baseball example continued

We estimate the baseball player model; unrestricted version
library(AER); library(wooldridge); data("mlb1")

unr <- lm( log(salary) ~  years + gamesyr + bavg + hrunsyr + rbisyr, data = mlb1)
summary(unr)
      
      Call:
      lm(formula = log(salary) ~ years + gamesyr + bavg + hrunsyr + 
          rbisyr, data = mlb1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -3.02508 -0.45034 -0.04013  0.47014  2.68924 
      
      Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 1.119e+01  2.888e-01  38.752  < 2e-16 ***
      years       6.886e-02  1.211e-02   5.684 2.79e-08 ***
      gamesyr     1.255e-02  2.647e-03   4.742 3.09e-06 ***
      bavg        9.786e-04  1.104e-03   0.887    0.376    
      hrunsyr     1.443e-02  1.606e-02   0.899    0.369    
      rbisyr      1.077e-02  7.175e-03   1.500    0.134    
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.7266 on 347 degrees of freedom
      Multiple R-squared:  0.6278,  Adjusted R-squared:  0.6224 
      F-statistic: 117.1 on 5 and 347 DF,  p-value: < 2.2e-16
  • Note that none of performance variables is statistically significant

We estimate the baseball player model, this time in its restricted version
r <- lm( log(salary) ~  years + gamesyr, data = mlb1)
summary(r)
      
      Call:
      lm(formula = log(salary) ~ years + gamesyr, data = mlb1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -2.66858 -0.46412 -0.01177  0.49219  2.68829 
      
      Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 11.223804   0.108312 103.625  < 2e-16 ***
      years        0.071318   0.012505   5.703  2.5e-08 ***
      gamesyr      0.020174   0.001343  15.023  < 2e-16 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.7527 on 350 degrees of freedom
      Multiple R-squared:  0.5971,  Adjusted R-squared:  0.5948 
      F-statistic: 259.3 on 2 and 350 DF,  p-value: < 2.2e-16

We calculate the F-statistic from Equation 3.8
SSRur <- sum( unr$residuals^2 )
SSRr <- sum( r$residuals^2 )
df <- summary(unr)$df[2]

Fstat <- (SSRr - SSRur)/SSRur*df/3
Fstat
      [1] 9.550254
# 95% critical value
qf(0.95, 3, df)
      [1] 2.630641
# p-value
1 - pf(Fstat, 3, df)
      [1] 4.473708e-06
  • The H_0 that none of the performance variables have any impact on the salary of the players is strongly rejected, although the variables on its own right are not significant – multicollinearity problem

  • Compare Figure 3.9


We carry out the same test with lht() (for linearHypothesis), contained in the AER package
  • Applying this procedure we notice that actually, it is not really necessary to estimate the restricted model to carry out the test. We only need the unrestricted model, as it was the case with the simple t-test
lht(unr, c("bavg=0", "hrunsyr=0", "rbisyr=0"))
      Linear hypothesis test
      
      Hypothesis:
      bavg = 0
      hrunsyr = 0
      rbisyr = 0
      
      Model 1: restricted model
      Model 2: log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr
      
        Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
      1    350 198.31                                  
      2    347 183.19  3    15.125 9.5503 4.474e-06 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • We get the very same results as with the F-test carried out by hand

We calculate the W-statistic from Equation 3.10; the chi squared version
sig2ur <- SSRur/df 

Wstat <-  (SSRr - SSRur)/sig2ur
Wstat
      [1] 28.65076
# 95% critical value
qchisq(0.95,3)
      [1] 7.814728
# p-value
1 - pchisq(Wstat, 3)
      [1] 2.651604e-06

The same test with lht(), contained in the AER package; chi squared version
lht(unr, c("bavg=0", "hrunsyr=0", "rbisyr=0"), test = "Chisq")
      Linear hypothesis test
      
      Hypothesis:
      bavg = 0
      hrunsyr = 0
      rbisyr = 0
      
      Model 1: restricted model
      Model 2: log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr
      
        Res.Df    RSS Df Sum of Sq  Chisq Pr(>Chisq)    
      1    350 198.31                                   
      2    347 183.19  3    15.125 28.651  2.652e-06 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • We get the same result as doing the testing by hand and nearly the same result as with the F-statistic version

  • Compare Figure 3.10


3.8.3 “the” F-statistic of a regression model

A special case of the F-test is a test of the overall significance of a regression with the

H_0: \beta_1 = \beta_2 = \ldots = \beta_k = 0

This null hypothesis states that the explanatory variables overall are not useful in explaining the dependent variable

  • This test of overall significance is routinely calculated and reported in most regression packages as “the” F-statistic; the null hypothesis is usually overwhelmingly rejected

  • If this H_0 cannot be rejected, the model is of no value in explaining the dependent variable


3.8.4 Additional examples

As a last example in this chapter we will re-estimate our wage equation

log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 tenure + u

  • We will test whether the effect of one additional year of education has a threefold effect than the effect of one additional year of tenure; H_0: \beta_1 = 3 \beta_3

    • As this hypothesis involves only one restriction, a t-test could be used. However, the restriction affects two parameters, therefore, a reparameterization of the model with a variable transformation would be necessary to implement this restriction

    • Using the R function lht() (for linearHypothesis) of the car and AER package implements this test very easily


library(AER);  library(wooldridge);  data("wage1")

# we first estimate the model
summary(  out <- lm(lwage ~ educ + tenure + exper, data = wage1)  )
      
      Call:
      lm(formula = lwage ~ educ + tenure + exper, data = wage1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -2.05802 -0.29645 -0.03265  0.28788  1.42809 
      
      Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
      educ        0.092029   0.007330  12.555  < 2e-16 ***
      tenure      0.022067   0.003094   7.133 3.29e-12 ***
      exper       0.004121   0.001723   2.391  0.01714 *  
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.4409 on 522 degrees of freedom
      Multiple R-squared:  0.316,   Adjusted R-squared:  0.3121 
      F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16
  • Note “the” F-statistic!

  • We test the H0: beta1 = 3 beta3
lht( out, "educ = 3*tenure") 
      Linear hypothesis test
      
      Hypothesis:
      educ - 3 tenure = 0
      
      Model 1: restricted model
      Model 2: lwage ~ educ + tenure + exper
      
        Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
      1    523 102.29                              
      2    522 101.46  1   0.83518 4.2971 0.03867 *
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  • We additionally test whether beta1 = 0.1

  • If we have several restrictions, the restrictions must be put into a character vector with the ‘c()’ function

lht( out, c("educ = 3*tenure", "educ=0.1") )
      Linear hypothesis test
      
      Hypothesis:
      educ - 3 tenure = 0
      educ = 0.1
      
      Model 1: restricted model
      Model 2: lwage ~ educ + tenure + exper
      
        Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
      1    524 104.47                                  
      2    522 101.46  2    3.0199 7.7688 0.0004735 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • These two restrictions together on the model are clearly not compatible with the data anymore

3.9 Tests in matrix notation

Wald test statistic with general linear hypothesis in matrix notation

Every set of linear hypothesis can be stated as:

\mathbf R \boldsymbol{\hat{\beta}} = \mathbf c \tag{3.11}

Thereby, the restriction matrix \mathbf R has dimensions m x (k+1), with m being the number of restrictions. Consider the following example:

\underbrace{\left[\begin{array}{cccc} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{array}\right]}_{R}\left[\begin{array}{c} \hat{\beta}_{0} \\ \hat{\beta}_{1} \\ \hat{\beta}_{2} \\ \hat{\beta}_{3} \end{array}\right] = \underbrace{\left[\begin{array}{c} 0 \\ 0 \end{array}\right]}_{c}

Multiplying out we see that, in this case, we test whether \hat \beta_2=0 and \hat \beta_3=0.

We have the following important Theorem:

Theorem 3.3 (Distribution of a quadratic form) Let \mathbf z be a N(\mathbf 0, \mathbf V) normal distributed random vector with h elements. Then the quadratic form

\mathbf z' \mathbf V^{-1} \mathbf z

is distributed as

\mathbf z' \mathbf V^{-1} \mathbf z \ \sim \ \chi^2(h) (see Appendix A.6 and A.9).


  • Assuming that \sigma^2 is a known constant and applying covariance formula from Equation C.7 to the restriction vector (R \boldsymbol{\hat{\beta}} - \mathbf c), we get the covariance matrix of this random vector:

\sigma^{2} \mathbf{R} \left( \mathbf{X}' \mathbf{X} \right)^{-1} \mathbf R' \tag{3.12}

  • As \boldsymbol{\hat \beta} is asymptotically normally distributed (or normally distributed in small samples if \mathbf u is normally distributed), provided H_0 is true, the restriction vector is

(R \boldsymbol{\hat{\beta}} - \mathbf c) \ \sim \ N(\mathbf 0, \sigma^{2} \mathbf{R} \left( \mathbf{X}' \mathbf{X} \right)^{-1} \mathbf R')

  • Applying Theorem 3.3 from above we conclude that the following must hold:

W \ = \ \underbrace{(\mathbf{R} \hat{\boldsymbol{\beta}}-\mathbf{c})^{\prime}}_{z^{\prime}} \underbrace{\left(\sigma^{2} \mathbf{R}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{R}^{\prime}\right)^{-1}}_{V^{-1}} \underbrace{(\mathbf{R} \hat{\boldsymbol{\beta}}-\mathbf{c})}_{z} \ \sim \ \chi^{2}(m) \tag{3.13}

This Wald test statistic is a weighted sum (by the inverse of the covariance matrix) of the distance between the hypothesized values \mathbf c and the actual estimation outcomes \mathbf R \boldsymbol{\hat \beta}. If this normed distance is too large, H_0 is rejected.

For m = 1 (only one restriction), \sqrt W reduces to a simple t-statistic.


If \sigma^2 is unknown, it has to be estimated by \hat \sigma^2={\sum_{i=1}^n \hat u_i^2} \ / \ {(n-k-1)}.

  • With the help of the above theorem, it can be shown that this estimator (after scaling with \sigma^2) is distributed as \chi^2(n-k-1)

  • Hence, in this case, the Wald-statistic Equation 3.13 is the ratio of two \chi^2, independently distributed random variables and therefore is F-distributed after appropriate scaling (see Appendix A.7):

F \ = \ \frac{(\mathbf R \boldsymbol{\hat{\beta}} - \mathbf c)' \left(\hat\sigma^{2} \mathbf R\left(\mathbf X' \mathbf X\right)^{-1} \mathbf R'\right)^{-1}(\mathbf R \boldsymbol{\hat{\beta}} - \mathbf c)}{m} \ \sim \ F(m,n-k-1) \tag{3.14}

Note, in small samples this formula is valid only if \mathbf u is normally distributed. In large samples, Equation 3.14 times m converges to Equation 3.13 and so the two test statistics are asymptotically equivalent.

  • Furthermore, it can be shown that the formula above is equivalent to:

F \ = \ \frac{\left(\hat{\mathbf{u}}_{r}^{\prime} \hat{\mathbf{u}}_{r}-\hat{\mathbf{u}}_{ur}' \hat{\mathbf{u}}_{ur} \right) / m}{\hat{\mathbf{u}}'_{ur} \hat{\mathbf{u}}_{ur} /(n-k-1)} \ \sim \ F(m, n-k-1) \tag{3.15}

The interpretation of this test-statistic is straightforward: If the percentage decrease of the fit (measured by SSR) of the restricted model is too large compared to the model without imposing the restrictions, the restricted model is rejected.