3 Inference – Econometrics with R

3.1 Topics of this chapter

This chapter deals with statistical inference in the regression model, particularly with

Hypothesis tests about population parameters
Construction of confidence intervals

For both we need the distribution of the OLS estimators for repeated samples

The OLS estimators \hat \beta_j and \hat \sigma^2 depend on the particular sample and are therefore random variables
We already know their expected values and their variances
However, for hypothesis tests we need to know their distribution
In order to derive their distribution we need additional assumptions, at least for small samples
- MLR.6: Errors are normally distributed u_i \ \sim \ N(0, \sigma^2) which implies that the depended variable y is normally distributed as well, conditional on the values of the explanatory variables in \mathbf x

y_i \, | \, \mathbf x \ \sim \ N(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k, \, \sigma^2)

3.2 Statistical inference in general

Discussion of the normality assumption

The error term u_i is interpreted as the sum of many different unobserved factors
According to the Central Limit Theorem (CLT, see Theorem A.3), sums (means) of many independent factors tends to be normally distributed
Problems:
- Are the conditions for the CLT met? In particular:
  - How many different factors? Is the number large enough?
  - Possibly very heterogeneous distributions of individual factors
  - How independent are the different factors from each other?

The normality of the error term is an empirical question

At least the error distribution should be “close” to normal
In many cases, normality is questionable or even impossible by definition
- Examples where normality cannot strictly hold:
  - Wages (non-negative; also: minimum wage)
  - Number of arrests (takes on a small number of integer values)
  - Employed/not employed (indicator, respectively, binary dependent variable, takes on only 1 or 0)
In some cases, normality can be achieved through transformations of the dependent variable (e.g. use log(wage) instead of wage)

Very important: Under certain regularity conditions regarding the data and u_i the assumption of normality can be replaced by a large sample size. In that case the error term needs not to be normally distributed.

The following terminology is often used:

\underbrace{\text{MLR.1 - MLR.5}}_{\text{Gauss-Markov assumptions}} \quad \quad \underbrace{\text{MLR.1 - MLR.6}}_{\text{Classical linear model (CLM) assumptions}}

Furthermore, we have the following important theorem:

Theorem 3.1 (Normal sampling distributions) Under assumptions MLR.1 - MLR.6

\hat \beta_j | \, \mathbf x \ \sim \ N \left( \beta_j, \operatorname{Var}(\hat \beta_j) \right) \tag{3.1}

That means that, given the assumptions above, the OLS estimator \hat \beta_j is unbiased and conditionally normally distributed around the true parameter with a variance that was derived earlier.

Additionally:

\dfrac {\hat \beta_j - \beta_j}{sd(\hat \beta_j)} \ \sim \ N(0,1) \tag{3.2}

This means that the standardized estimator follows a standard normal distribution

Proof of Theorem 3.1

The proof is quite straightforward if we use matrix notation, see Section 2.4.1

The model in matrix notation is:

\mathbf y = \mathbf X \boldsymbol \beta + \mathbf u And the OLS estimator for \boldsymbol{\hat \beta} is (see Equation C.2):

\hat {\boldsymbol \beta} = (\mathbf X' \mathbf X)^{-1} \mathbf X' \mathbf y This can be written as (Equation C.3):

\hat {\boldsymbol \beta} = \boldsymbol \beta + \underbrace {(\mathbf X' \mathbf X)^{-1} \mathbf X' \mathbf u}_{\mathbf {Au}} \tag{3.9}

The typical element of the vector \mathbf {Au} is \sum_{j=1}^n a_{ij} \, u_j
- Hence, conditional on \mathbf X, the OLS estimator \hat {\boldsymbol \beta} is a linear combination of the normally distributed errors u_{j}
- According to the Reproductive Property of the normal distribution, \hat {\boldsymbol \beta} therefore is normally distributed as well:

\hat {\boldsymbol \beta} \, | \, \mathbf X \ \sim \ N \left( \boldsymbol \beta, \ \sigma^2 (\mathbf X' \mathbf X)^{-1} \right)

Here, we used the fact that under MLR.1 - MRL.4, \hat {\boldsymbol \beta} is unbiased
For the derivation of the covariance matrix of \hat {\boldsymbol \beta} we used MLR.5 and formula Equation C.7

3.3 Testing hypotheses about a single population parameter

We have the following theorem:

Theorem 3.2 (t-distribution for the standardized estimators) Given MLR.1 - MLR.6, we state:

\dfrac {\hat \beta_j - \beta_j}{se(\hat \beta_j)} \ \sim \ t_{n-k-1} \tag{3.3}

That means that if the standardization in Equation 3.2 is done using the estimated standard deviation of \hat \beta_j, the normal distribution from Equation 3.2 is replaced by a t-distribution
- The estimated standard deviation of \hat \beta_j is called standard error
- The standard error, se(\hat \beta_j), is calculated with \hat \sigma^2 instead of the true \sigma^2 in Equation 2.37. Therefore, se(\hat \beta_j) is itself the square root of a random variable, \hat \sigma ^2, which is usually \chi^2 distributed
In this case, we actually have a ratio of two random variables in Equation 3.3 – a normal over the square root of a \chi^2 distributed random variable, which generally leads to a t-distribution (under certain conditions). That’s our heuristic proof of Theorem 3.2
A very important special case arises if we test the particular null hypothesis H_0 : \beta_j=0
- This means that we hypothesize that the true population parameter is equal to zero, i.e. there is no effect of x_j on y after controlling for the other independent variables
- This is the so called Significance Test. In this case, presuming the H_0 is true, Equation 3.3 reduces to the t-statistic

t_{\hat \beta_j} \ \equiv \ \dfrac {\hat \beta_j - 0}{se(\hat \beta_j)} \ = \ \dfrac {\hat \beta_j}{se(\hat \beta_j)} \ \sim \ t_{n-k-1} \tag{3.4}

If we, more generally, hypothesize that \beta_j = a_j, the test statistic is (presuming the H_0 is true)

\dfrac {\hat \beta_j-a_j} {se(\hat \beta_j)} \ \sim \ t_{n-k-1} \tag{3.5}

Generally, the t-statistic is a standardized measure of the distance between the estimated parameter and its hypothesized value (which is zero or a_j)

In particular, it measures how many estimated standard deviations the estimated coefficient is away from the H_0 (0 or a_j)

The basic test idea is the following: The farther the estimated coefficient is away from the hypothesized value zero (or a_j), the less likely it is that the null hypothesis holds true
- But what does “far away” from zero (or a_j) mean? This is measured by the t-statistic above and assessed with the help of the distribution of the t-statistic
- If the H_0 is true, we know from Theorem 3.2 that the distribution of the t-statistic is the t-distribution (so, the name of the t-statistic is not accidentally)
  - If the degrees of freedom of the t-distribution are greater than 60 (many observations), the t-distribution is practical identical to a standardized N(0,1) normal distribution

3.3.1 Testing procedure for a single parameter

How is a test procedure actually carried out?

From Theorem 3.2, we know the distribution of the test statistic
We carry out the estimation of \beta_j and se(\beta_j) and calculate the test statistic Equation 3.5
We want to test a hypothesis H_0 about the true value of the estimated parameter – actually we want to reject the H_0
The test statistic shows how far away the estimate \hat \beta_j is from the hypothesized value
To asses whether the test statistic is far away enough to reject the H_0 we construct a rejection region so that, if the null hypothesis is true, the test statistic only accidentally falls into this rejection region, for instance in only 5% of the cases. This small probability is freely chosen and called significance level
1. The significance level \alpha is the probability of making a so called type I error or \alpha-error, which is the probability to falsely reject a true hypothesis accidentally (only due to sampling errors). In most cases \alpha=0.05 (5%) is chosen, but \alpha=0.01 (1%) or \alpha=0.1 (10%) are also quite common
2. The boundary points of the rejection region are called critical values. These are the values which are exceeded (deceeded) in x% of the cases. Formally, we are looking for the corresponding quantile of the test statistic’s distribution
Having constructed the rejection region of the test, we reject the null hypothesis H_0: \beta_j=a_j in favor of the alternative hypothesis if the test statistic falls into this rejection region. In this case the estimated value is “too far away” from the hypothesized value
There are two types of alternative hypothesis H_1:
1. H_1: \beta_j \neq a_j for a two sided or two-tailed test
2. either H_1: \beta_j>a_j or H_1: \beta_j<a_j for a one sided or one-tailed test
3. For one-sided tests, it is sometimes not obvious, whether one could apply a left- or right sided one. There is a simple rule: If the t-statistic is negative, we can only apply a left-sided test, and if the t-statistic is positive, we can only apply a right-sided test. Otherwise, the H_0 could never be rejected
Two-tailed tests are much more common. One-tailed tests are used if the researcher is only interested in a particular “direction” of the parameter, e.g., with a unit-root test (a particular test for the stationarity of time series) one is only keen on whether the parameter of interest might be \geq 1

Code

# Plot the normal distribution on interval [-5,5]
t <- seq(-5, 5, 0.01)
cvl <- round( qnorm(0.95), digits = 2) # critical value as quantile of the norm. distrib.

plot(x = t, y = dnorm(t),  type = "l",  col = "black",  lwd = 2, 
     main = expression("One-tailed test with normal distrib. and" ~ alpha ~ "= 0.05; e.g., H"[0]: ~ beta ~ "≤ 0, H"[1]: ~ beta ~" > 0"),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=-5:5, pos = c(0,0)) 
axis(side=2) 


# critical region in right tail
polygon(x = c(cvl, seq(cvl, 5, 0.01), 5),
        y = c( 0, dnorm(seq(cvl, 5, 0.01)), 0 ), 
        col = 'cornflowerblue')


# text
text(3.5, 0.15, labels = expression( "Rejection region," ~ alpha ~ "= 0.05"),cex = 1)
arrows(3.6, 0.133, 2.3, 0.04, length = 0.08)

text(0, 0.1, labels = bquote( qn[0.95] ~ " = " ~ .(cvl)), cex = 1)
arrows(0, 0.085, cvl-0.03, 0.002, length = 0.08)

text(-3.5, 0.3, labels = expression("Distribution of the test statistic"),cex = 1)
arrows(-3.6, 0.28, -1.22, 0.2, length = 0.08)

Figure 3.1: Distribution of the t-statistic - in this case a normal distribution

Code

# Plot the normal distribution on interval [-5,5]
t <- seq(-5, 5, 0.01)

# lower and upper critical values
cvl <- round( qnorm(0.025), digits = 2 ); cvr <- round( qnorm(0.975), digits = 2)

plot(x = t, y = dnorm(t),  type = "l",  col = "black",  lwd = 2, 
     main = expression("Two-tailed test with normal distrib. and" ~ alpha ~ "= 0.05; e.g., H"[0]: ~ beta ~ "= 0, H"[1]: ~ beta ~" ≠ 0"),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=-5:5, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(-5, seq(-5, cvl, 0.01), cvl),
        y = c( 0, dnorm(seq(-5, cvl, 0.01)), 0 ), 
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr, seq(cvr, 5, 0.01), 5),
        y = c( 0, dnorm(seq(cvr, 5, 0.01)), 0 ), 
        col = 'cornflowerblue')


text(-4, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1)
arrows(-4, 0.13, -2.5, 0.026, length = 0.08)

text(4, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1)
arrows(4, 0.13, 2.5, 0.026, length = 0.08)

text(-0.5, 0.15, labels = bquote( qn[0.025] ~ " = " ~ .(cvl)), cex = 1)
arrows(-0.5, 0.135, cvl+0.03, 0.002, length = 0.08)

text(0.5, 0.1, labels = bquote( qn[0.975] ~ " = " ~ .(cvr)), cex = 1)
arrows(0.5, 0.085, cvr-0.03, 0.002, length = 0.08)

Figure 3.2: Distribution of the t-statistic - in this case a normal distribution

Code

# Plot the t(4) Distribution on interval [-5,5]
t <- seq(-5, 5, 0.01)

# lower and upper critical values
cvl <- round( qt(0.025, 4), digits = 2 ); cvr <- round( qt(0.975, 4), digits = 2)

plot(x = t, y = dt(t, 4),  type = "l",  col = "black",  lwd = 2, 
     main = expression("Two-tailed test with t(4) distribution and" ~ alpha ~ "= 0.05; e.g., H"[0]: ~ beta ~ "= 0, H"[1]: ~ beta ~" ≠ 0"),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=-5:5, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(-5, seq(-5, cvl, 0.01), cvl),
        y = c( 0, dt(seq(-5, cvl, 0.01), 4), 0 ), 
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr, seq(cvr, 5, 0.01), 5),
        y = c( 0, dt(seq(cvr, 5, 0.01), 4), 0 ), 
        col = 'cornflowerblue')


text(-4.5, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1)
arrows(-4.5, 0.13, -3.8, 0.02, length = 0.08)

text(4.5, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1)
arrows(4.5, 0.13, 3.8, 0.02, length = 0.08)

text(-2.2, 0.18, labels = bquote( qt[0.025] ~ " = " ~ .(cvl)), cex = 1)
arrows(-2.2, 0.165, cvl+0.03, 0.0015, length = 0.08)

text(0.55, 0.09, labels = bquote( qt[0.975] ~ " = " ~ .(cvr)), cex = 1)
arrows(0.65, 0.075, cvr-0.03, 0.002, length = 0.08)

Figure 3.3: Distribution of the t-statistic - in this case a t-distribution

Code

# Plot the t(3) Distribution on interval [-5,5]
t <- seq(-5, 5, 0.01)

# lower and upper critical values
cvl <- round( qt(0.01, 4), digits = 2 ); cvr <- round( qt(0.99, 4), digits = 2)

plot(x = t, y = dt(t, 4),  type = "l",  col = "black",  lwd = 2, 
     main = expression("Two-tailed test with t(4) distribution and" ~ alpha ~ "= 0.02; e.g., H"[0]: ~ beta ~ "= 0, H"[1]: ~ beta ~" ≠ 0"),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=-5:5, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(-5, seq(-5, cvl, 0.01), cvl),
        y = c( 0, dt(seq(-5, cvl, 0.01), 4), 0 ), 
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr, seq(cvr, 5, 0.01), 5),
        y = c( 0, dt(seq(cvr, 5, 0.01), 4), 0 ), 
        col = 'cornflowerblue')


text(-4.5, 0.15, labels = expression("0.01"~"="~over(alpha, 2)),cex = 1)
arrows(-4.5, 0.13, -4.1, 0.015, length = 0.08)

text(4.5, 0.15, labels = expression("0.01"~"="~over(alpha, 2)),cex = 1)
arrows(4.5, 0.13, 4.1, 0.015, length = 0.08)

text(-2.2, 0.18, labels = bquote( qt[0.01] ~ " = " ~ .(cvl)), cex = 1)
arrows(-2.2, 0.165, cvl+0.03, 0.0015, length = 0.08)

text(0.55, 0.09, labels = bquote( qt[0.99] ~ " = " ~ .(cvr)), cex = 1)
arrows(0.65, 0.075, cvr-0.03, 0.002, length = 0.08)

Figure 3.4: Distribution of the t-statistic - in this case a t-distribution

3.3.2 Example for tests - wage equation

We first estimate the model – in this case for explaining the received wage of persons

library(AER);  library(wooldridge);  data("wage1")

output <- summary( lm(lwage ~ educ + tenure + exper, data = wage1)  )
output

      
      Call:
      lm(formula = lwage ~ educ + tenure + exper, data = wage1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -2.05802 -0.29645 -0.03265  0.28788  1.42809 
      
      Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
      educ        0.092029   0.007330  12.555  < 2e-16 ***
      tenure      0.022067   0.003094   7.133 3.29e-12 ***
      exper       0.004121   0.001723   2.391  0.01714 *  
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.4409 on 522 degrees of freedom
      Multiple R-squared:  0.316,   Adjusted R-squared:  0.3121 
      F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16

We want to test whether `tenure` has any effect on lwage

Significance Test: H0: b3 = 0, H1: b3 ≠ 0, two-tailed
Choosing the significance level alpha = 5%

alpha <- 0.05
# printing the coefficients, standard errors and t-statistic
coef(output)

                     Estimate  Std. Error   t value     Pr(>|t|)
      (Intercept) 0.284359541 0.104190379  2.729230 6.562466e-03
      educ        0.092028988 0.007329923 12.555246 8.824197e-32
      tenure      0.022067218 0.003093649  7.133070 3.294407e-12
      exper       0.004121109 0.001723277  2.391437 1.713562e-02

# "the" t-statistic:  b3 / se(b3)
coef(output)[3,1] / coef(output)[3,2]

      [1] 7.13307

# the 2.5% and 97.5% quantile of the standardized normal distribution: rejection region in both tails

# Rule of Thumb for two-tailed tests: reject the H0 if |t-statistic| ≥ 2
c( qnorm(alpha/2), qnorm(1-alpha/2) )

      [1] -1.959964  1.959964

The coefficient of tenure is 0.022 and therefore quite small in absolute terms. So, one might conclude that the coefficient could also be 0. However, the t-statistic is about 7, which means that the estimated coefficient for tenure is roughly seven times its standard deviation away from zero — which is very substantial.

The critical values are about \pm2.

Hence, the t-statistic clearly falls into the rejection region; thus the H_0 of no effect of tenure on lwage is overwhelming rejected. Therefore, H_1 is very likely to be true, subject to a small type I error of course.

We want to test whether `tenure` has really a positive effect on lwage

One sided test: H0: b3 ≤ 0, H1: b3 > 0 – right-tailed
We choose a significance level alpha = 5%

alpha <- 0.05

# printing the coefficients, standard errors and t-statistic
coef(output)

                     Estimate  Std. Error   t value     Pr(>|t|)
      (Intercept) 0.284359541 0.104190379  2.729230 6.562466e-03
      educ        0.092028988 0.007329923 12.555246 8.824197e-32
      tenure      0.022067218 0.003093649  7.133070 3.294407e-12
      exper       0.004121109 0.001723277  2.391437 1.713562e-02

# the t-statistic: b3 / se(b3)
coef(output)[3,1] / coef(output)[3,2]

      [1] 7.13307

# the 95% quantil of the standardized normal distribution: rejection region only in the right tail
qnorm(1-alpha)

      [1] 1.644854

The t-statistic is about 7, which means that the estimated coefficient for tenure is about seven times the standard deviations of the estimate away from 0.

The critical values is about 1.64.

Hence, the t-statistic clearly falls into the rejection region, thus the H_0 of no or a negative effect of tenure on lwage is overwhelming rejected. So, H_1 seems to be true, subject to a small type I error of course.

We want to test whether `educ` could have an effect on lwage of at least 10%:

H0: b2 ≥ 0.1, H1: b2 < 0.1 – left-tailed (because estimated coefficient is less than 0.1)
Choosing the significance level alpha = 5%

alpha <- 0.05
#| comment: "     "

# printing the coefficients, standard errors and t-statistic
coef(output)

                     Estimate  Std. Error   t value     Pr(>|t|)
      (Intercept) 0.284359541 0.104190379  2.729230 6.562466e-03
      educ        0.092028988 0.007329923 12.555246 8.824197e-32
      tenure      0.022067218 0.003093649  7.133070 3.294407e-12
      exper       0.004121109 0.001723277  2.391437 1.713562e-02

# the t-statistic: (b2 - 0.1) / sd(b2) 
( coef(output)[2,1] - 0.1 ) / coef(output)[2,2]

      [1] -1.087462

# the 5% quantil of the standardized normal distribution: rejection region in the left tail
qnorm(alpha)

      [1] -1.644854

The t-statistic does not fall into the rejection region.

Therefore H_0 is not rejected - so not enough evidence to be sure of H_1

However, b2 < 0.1, H_1, might be true anyway – type II error

3.4 Types of errors and power of statistical tests

There are principally two types of errors in statistical tests
- Type I error or \alpha-error
  - This is the probability to falsely reject a true H_0 due to sampling errors. This probability is chosen via the significance level by the researcher. So, why not choosing a very low significance level \alpha to minimize this error? In this case, rejecting the H_0 would give us more confidence regarding the rejection. But the downside is that a low significance level increases another type of error
- Type II error or \beta-error
  - This is the probability of accepting a false H_0. Looking at Figure 3.3 and Figure 3.4 it is pretty clear that reducing the significance level, for instance from \alpha= 0.05 to 0.02, shrinks the rejection region and therefore enlarges the acceptance region. In this case, the probability of rejecting the H_0 in repeated samples is decreased even if the H_0 is false
- The power of a test is the probability to reject a false H_0. Hence, this is one minus the probability of making a type II error and one of the most important and desirable properties of a test – it is the ability of the test to discriminate between true and false hypothesis
  - A low significance level, as well as a small standardized distance between the hypothesized and estimated values, decreases the power of a test. Therefore, all factors that reduce the standard errors of the estimates — most notably a large number of observations and, respectively, low multicollinearity or large variation in the explanatory variables — increase the test’s power

3.5 p-value

The classical test strategy of

choosing a significance level
calculating the test statistic
looking whether the test statistic is within the rejection region
(i.e. whether |test statistic| ≥ |critical value|)

is somewhat arbitrary as the result of the test depends on the chosen significance level

As an example, suppose the observed t-statistic for a significance test, i.e. H_0: \beta_j=0 versus H_1: \beta_j≠0, equals 1.85
- Using a significance level of 0.05, the H_0 cannot be rejected – the corresponding critical values, q_{0.025}, q_{0.975}, are about ±2
- However, using a significance level of 0.1, the H_0 can be rejected – the corresponding critical values, q_{0.05}, q_{0.95}, are about ±1.65
- So, what conclusions should be drawn?

The concept of the p-value provides an alternative to this and answers the following question: given the observed t-statistic, what is the smallest significance level at which the H_0 would be rejected?
This has the following interpretation:
- The p-value is the probability of observing a t-statistic as we did (or even larger in absolute terms) if the H_0 is true
- Hence, a small p-value is evidence against the H_0 as the actually observed (large) t-statistic is very unlikely a result of sampling errors in this case, and vice verse
- Alternatively: A small p-value is evidence against the null hypothesis because one can reject the null hypothesis even at small significance levels (small error probability)
To calculate a p-value by hand is seldom necessary (the econometric package usually does the work). It could be done the following way, which, in a sense, is a reversal of the classical test strategy
- Construct the rejection region, using ± of the t-statistic as critical values
- Calculate the implied significance level. This is done with the cdf() (cumulative distribution function). For our example, the implied significance level is: \operatorname {cdf}(-1.85) + (1 - \operatorname {cdf}(1.85))=0.072.
  The R-code for an assumed t_{40} distribution is [1-pt(\operatorname {abs}(1,85), 40)] \times 2=0.072
The following Figure Figure 3.5 illustrates the concept of the p-value

Code

# Plot of prob.value

b=0    # beta hat
sd=1  # se(beta hat)
t=1.85
df=40
a=round(  (1-pt(abs(t), df))*2, digits = 3 ) # prob value
#a=0.036   # significance level#

beg=round(-4)
end=round(4)
maxh=dnorm(0)

x <- seq(beg, end, 0.01)
cvl <- -t
cvr <- t

plot(x = x, y = dt(x, df),  type = "l",  col = "black",  lwd = 2, 
     main = bquote( "Observed t-statistic = 1.85," ~  t[40] ~ "distrib. => p-Value =  [1 - pt(| t |, 40)]*2 = " ~ .(round( a, digits=3))),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=beg:end, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(beg, seq(beg, cvl, 0.01), cvl),
        y = c( 0, dt(seq(beg, cvl, 0.01), df), 0 ),
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr, seq(cvr, end, 0.01), end),
        y = c( 0, dt(seq(cvr, end, 0.01), df), 0 ),
        col = 'cornflowerblue')



# left
text(beg+1, maxh*0.3, labels = bquote( over(alpha,2) ~ " =  1 - pt(| t |, 40)  = " ~ .(a/2) ), cex = 1)
arrows(beg+1, maxh*0.26, cvl*1.18  , maxh*0.045, length = 0.08)
# right
text(end-1, maxh*0.3, labels = bquote( over(alpha,2) ~ " =  1 - pt(| t |, 40) = " ~ .(a/2) ), cex = 1)
arrows(end-1, maxh*0.26, cvr*1.18, maxh*0.045, length = 0.08)
# acceptance 
text(end-1, maxh*0.85, labels = bquote( "(1 -" ~ alpha ~ ") = " ~ .(1-a) ), cex = 1)
arrows(end-1, maxh*0.81, end*0.1, maxh*0.6, length = 0.08)


# left
text( -0.8, maxh*0.35, labels = bquote( .(-t)), cex = 1)
arrows(-0.8, maxh*0.31, -t, maxh*0.003, length = 0.08)
# right
text(0.8, maxh*0.35, labels = bquote( .(t)), cex = 1)
arrows(0.8, maxh*0.31, t, maxh*0.003, length = 0.08)

text(-2.8, 0.3, labels = bquote("Distribution of the test statistic," ~ t[40]),cex = 1)
arrows(-2.8, 0.28, -1.12, 0.22, length = 0.08)

Figure 3.5: Distribution of the t-statistic

3.6 General remarks on parameter testing

Guidelines for discussing economic and statistical significance

All statistical tests described in this chapter require at least MLR.1 - MLR.5 (for small samples, we additionally need MLR.6). Violations of some assumptions need special treatments
If a variable is statistically significant, discuss the magnitude of the coefficient to get an idea of its economic or practical importance
The fact that a coefficient is statistically significant does not necessarily mean it is economically or practically significant!
- This is especially true for data sets with very many observations. For instance, if you have thousands of observations in a cross section data set, nearly every (true) effect is statistically significant, even the most trivial ones
If a variable is statistically and economically important but has the “wrong” sign, the regression model might be misspecified
If a variable is statistically insignificant at the usual levels (10%, 5%, or 1%), one could consider dropping it from the regression
- However, if the sample size is small effects might be imprecisely estimated so, the case for dropping insignificant variables is less strong

3.7 Confidence intervals

Until now, we have dealt with point estimates of the unknown parameters. But according Theorem 3.1, we know that

{\hat \beta_j} \ \sim \ N\left(\beta_j, \, sd(\hat \beta_j)^2\right)

Now, imagine we want to construct an interval estimate of the true parameter \beta_j which covers the true parameter with a probability of say, 95% (provided our model is correctly specified, of course)

To do so, we need the 0.025 and 0.975 quantils of this distribution. It turns out that this quantils are simply the ones of the standardized normal distribution (or the t distribution, if \sigma^2 is unknown and has to be estimated) times the estimated standard deviation of \hat \beta_j, i.e., se(\hat \beta_j)
This argument leads to the so called 95% Confidence Intervals, which covers the true parameter with a probability of 95%. As the distribution at hand is symmetric, we have q_{0.025}=-q_{0.975} and therefore

\beta_j \ \ \underset {95\%} \in \ \ \left[ \hat \beta_j \ ± \ q_{0.975} \times se(\hat \beta_j) \right] \tag{3.6}

This way, we heuristically constructed an interval estimate for \beta_j, considering sample variability. Thereby, q_{0.975} denotes the 0.975 quantil of a standardized normal or a t_{n-k-1} distribution (for fewer than 60 degrees of freedom)

As the 0.975 quantil is about 2 of most cases (except for very low degrees of freedom), we can approximate the 95% confidence interval by a very simple formula, which is often used in practice

\beta_j \ \ \underset {95\%} \in \ \ \left[ \hat \beta_j \ ± \ 2 \times se(\hat \beta_j) \right] \tag{3.7}

For a 99% confidence interval, use 2.6, and for a 90% confidence interval use 1.65 instead of 2. These approximations work very well for more than 30 degrees of freedom
For a more formal derivation of confidence intervals, we start with Theorem 3.2 which implies:

P\left(-q_{0.975} \ < \ \dfrac{\hat{\beta}_{j}- \beta_j }{ se(\hat \beta_j)} \ < \ q_{0.975}\right)=0.95 \ \ \Rightarrow\left[\times se(\hat \beta_j)\right] \\ P\left(-q_{0.975} \times se(\hat \beta_j) \ < \ \hat{\beta}_{j}- \beta_j \ < \ q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow\left[-\hat{\beta}_{j}\right] \\ P\left(-q_{0.975} \times se(\hat \beta_j)-\hat{\beta}_{j} \ < \ -\beta_j \ < \ -\hat{\beta}_{j}+q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow[ \times -1] \\ P\left(q_{0.975} \times se(\hat \beta_j)+\hat{\beta}_{j} \ > \ \beta_j > \ \hat{\beta}_{j}-q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow

P\left(\hat{\beta}_{j}-q_{0.975} \times se(\hat \beta_j) \ < \ \beta_{j} \ < \ \hat{\beta}_{j}+q_{0.975} \times se(\hat \beta_j)\right) = 0.95

Code

# Plot of 95% confidence interval

b=2.3    # beta hat
sd=0.96  # se(beta hat)
a=0.05   # significance level#

beg=round(b-4*sd)
end=round(b+4*sd)
maxh=dnorm(b, b, sd)

x <- seq(beg, end, 0.01)
cvl <- round( qnorm(a/2, 0, 1), digits = 3 )
cvr <- round( qnorm(1-a/2, 0, 1), digits = 3)

plot(x = x, y = dnorm(x, b, sd),  type = "l",  col = "black",  lwd = 2, 
     main = bquote( .((1-a)*100) ~ "% Convidence Interval; e.g.," ~  hat(beta) ~ "=" ~ .(b) ~ ", " ~ se(hat(beta)) ~ "=" ~ .(sd)),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=beg:end, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(beg, seq(beg, cvl*sd + b, 0.01), cvl*sd + b),
        y = c( 0, dnorm(seq(beg, cvl*sd + b, 0.01), b, sd), 0 ), 
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr*sd + b, seq(cvr*sd + b, end, 0.01), end),
        y = c( 0, dnorm(seq(cvr*sd + b, end, 0.01), b, sd), 0 ), 
        col = 'cornflowerblue')


# confidence region
polygon(x = c(cvl*sd + b, seq(cvl*sd + b, cvr*sd+b, 0.01), cvr*sd+b),
        y = c( 0, dnorm(seq(cvl*sd + b, cvr*sd+b, 0.01), b, sd), 0 ), 
        col = 'gray99')


# Interval
segments( x0=b-abs(cvl)*sd, y0=0.0, x1=b+abs(cvl)*sd, y1=0.0, lwd=5, col = "gray") 
rug( c(b-abs(cvl)*sd, b+abs(cvl)*sd ), ticksize  = 0.08, lwd = 6, col = "darkgray")
rug( b, ticksize  = 0.05, lwd = 1.5)


# left
text(beg, maxh*0.3, labels = bquote( .(a/2) ), cex = 1)
arrows(beg, maxh*0.26, b+cvl*sd*1.18  , maxh*0.02, length = 0.08)
# right
text(end, maxh*0.3, labels = bquote( .(a/2) ), cex = 1)
arrows(end, maxh*0.26, b+cvr*sd*1.18, maxh*0.02, length = 0.08)

text(end, maxh*0.85, labels = bquote( .(1-a) ), cex = 1)
arrows(end, maxh*0.81, b+(end-b)*0.1, maxh*0.6, length = 0.08)



# left
text( beg+(b-beg)/4, maxh*0.45, labels = bquote( hat(beta) ~ "-" ~.(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b-abs(cvl)*sd)), cex = 1)
arrows(beg+(b-beg)/4, maxh*0.4, cvl*sd+b, maxh*0.01, length = 0.08)
# right
text(end-(end-b)/4, maxh*0.45, labels = bquote( hat(beta) ~ "+" ~ .(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b+cvr*sd)), cex = 1)
arrows(end-(end-b)/4, maxh*0.4, cvr*sd+b, maxh*0.01, length = 0.08)


# beta hat
text(b-0.1, maxh*0.2, labels = bquote(hat(beta) ~ "=" ~ .(b)), cex = 1.4)
arrows(b-0.1, maxh*0.16, b, maxh*0.025, length = 0.08) 

text(-1.0, 0.35, labels = expression("Estimated distribution of" ~ hat(beta)),cex = 1)
arrows(-1.1, 0.33, 1.11, 0.2, length = 0.08)

Code

# Plot of 99% confidence interval

b=2.3    # beta hat
sd=0.96  # se(beta hat)
a=0.01   # significance level#

beg=round(b-4*sd)
end=round(b+4*sd)
maxh=dnorm(b, b, sd)

x <- seq(beg, end, 0.01)
cvl <- round( qnorm(a/2, 0, 1), digits = 3)
cvr <- round( qnorm(1-a/2, 0, 1), digits = 3)

plot(x = x, y = dnorm(x, b, sd),  type = "l",  col = "black",  lwd = 2, 
     main = bquote( .((1-a)*100) ~ "% Convidence Interval; e.g.," ~  hat(beta) ~ "=" ~ .(b) ~ ", " ~ se(hat(beta)) ~ "=" ~ .(sd)),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=beg:end, pos = c(0,0)) 
axis(side=2) 

# critical region in left tail
polygon(x = c(beg, seq(beg, cvl*sd + b, 0.01), cvl*sd + b),
        y = c( 0, dnorm(seq(beg, cvl*sd + b, 0.01), b, sd), 0 ), 
        col = 'cornflowerblue')

# critical region in right tail
polygon(x = c(cvr*sd + b, seq(cvr*sd + b, end, 0.01), end),
        y = c( 0, dnorm(seq(cvr*sd + b, end, 0.01), b, sd), 0 ), 
        col = 'cornflowerblue')


# confidence region
polygon(x = c(cvl*sd + b, seq(cvl*sd + b, cvr*sd+b, 0.01), cvr*sd+b),
        y = c( 0, dnorm(seq(cvl*sd + b, cvr*sd+b, 0.01), b, sd), 0 ), 
        col = 'gray99')


# Interval
segments( x0=b-abs(cvl)*sd, y0=0.0, x1=b+abs(cvl)*sd, y1=0.0, lwd=5, col = "gray") 
rug( c(b-abs(cvl)*sd, b+abs(cvl)*sd ), ticksize  = 0.08, lwd = 6, col = "darkgray")
rug( b, ticksize  = 0.05, lwd = 1.5)


# left
text(beg, maxh*0.3, labels = bquote( .(a/2) ), cex = 1)
arrows(beg, maxh*0.26, b+cvl*sd*1.18  , maxh*0.02, length = 0.08)
# right
text(end, maxh*0.3, labels = bquote( .(a/2) ), cex = 1)
arrows(end, maxh*0.26, b+cvr*sd*1.18, maxh*0.02, length = 0.08)

text(end, maxh*0.85, labels = bquote( .(1-a) ), cex = 1)
arrows(end, maxh*0.81, b+(end-b)*0.1, maxh*0.6, length = 0.08)



# left
text( beg+(b-beg)/4, maxh*0.45, labels = bquote( hat(beta) ~ "-" ~.(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b-abs(cvl)*sd)), cex = 1)
arrows(beg+(b-beg)/4, maxh*0.4, cvl*sd+b, maxh*0.01, length = 0.08)
# right
text(end-(end-b)/4, maxh*0.45, labels = bquote( hat(beta) ~ "+" ~ .(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b+cvr*sd)), cex = 1)
arrows(end-(end-b)/4, maxh*0.4, cvr*sd+b, maxh*0.01, length = 0.08)


# beta hat
text(b-0.1, maxh*0.2, labels = bquote(hat(beta) ~ "=" ~ .(b)), cex = 1.4)
arrows(b-0.1, maxh*0.16, b, maxh*0.025, length = 0.08) 

text(-1.0, 0.35, labels = expression("Estimated distribution of" ~ hat(beta)),cex = 1)
arrows(-1.1, 0.33, 1.11, 0.2, length = 0.08)

3.7.1 Interpretation of the confidence interval

A confidence interval is constructed in the way that in repeated samples it will cover the population regression coefficient (the true value) in 95% of the cases
Hence, the bounds of the a confidence interval are random
We further have an important relationship between confidence intervals and hypotheses tests:

a_j \notin interval \ \Rightarrow \ \text{ reject } H_0: \beta_j=a_j \, \text{ in favour of } H_1: \beta_j \neq a_j

This means that if a specific value is not in the confidence interval, we can reject the null hypothesis in a two-tailed test that \beta_j is equal to this value
Hence, calculating confidence interval is an alternative way to carry out parameter tests
- Likewise, this equivalence to parameter tests implies that violations of assumptions MLR.1 – MLR.5 (MLR.6) may lead to invalid confidence intervals

Simulating Confidence Intervals

To demonstrate the interpretation of confidence intervals from above, we simulate the following simple regression model

y = 3 + 1 \times x + u with a sample size of n=100 and calculate the estimate for \beta_1 – with a true value of one – and the estimated corresponding 95% confidence interval with

lower \ bound \ = \ \hat \beta_1 - q_{0.975} \times se(\hat \beta_1) \\ \ upper \ bound \ = \ \hat \beta_1 + \underbrace {q_{0.975}}_{1.96} \times se(\hat \beta_1) - This whole procedure is repeated 500 times. We plot the first 100 of the calculated 500 confidence intervals and look how many of them cover the true value of \beta=1

Result: Regarding the whole 500 simulated intervals, 94.4% of them cover the true value of one. From the first 100 plotted, 95 cover and 5 don’t cover the true value. This is in a nearly perfect accordance with the expectations

Code

library(AER) # we need coeftest()
set.seed(1)  # set seed

rep <- 500  # number of calculated confidence intervals
n <- 100    # sample size of regressions
nrep <- 100 # first nrep confidence intervals of the simulated rep, which are plotted  

# initialize vectors of lower and upper interval boundaries
lower <- numeric(rep)
upper <- numeric(rep)

# loop of running regressions / estimation / CI
for(i in 1:rep) {
  # model
  x <- rnorm(n, mean=5, sd=5)
  u <- rnorm(n)
  y = 3 + x + u
  
  # regression
  ols <- lm(y ~ x)
  
  # CIs
  ctest <-  coeftest(ols)
  lower[i] <- ctest[2,1] - qt(1-0.05/2, n) * ctest[2,2]  
  upper[i] <- ctest[2,1] + qt(1-0.05/2, n) * ctest[2,2]
}

Code

# merge vectors of lower and upper confidence interval bounds into a matrix with two columns
CIs <- cbind(lower, upper)

# fraction of how many confidence intervals cover true value of 1
print( paste( mean( CIs[, 1] <= 1 & 1 <= CIs[, 2] )*100, "% of the simulated values are covered by the confidence interval" ) )

[1] "94.4 % of the simulated values are covered by the confidence interval"

Code

# identify intervals not covering true value of 1 for the first nreps cases
ID <- which( !(CIs[1:nrep, 1] <= 1 & 1 <= CIs[1:nrep, 2]) )
print( paste("The following CI does not cover the true value:", ID) )

[1] "The following CI does not cover the true value: 5" 
[2] "The following CI does not cover the true value: 14"
[3] "The following CI does not cover the true value: 23"
[4] "The following CI does not cover the true value: 27"
[5] "The following CI does not cover the true value: 66"

Code

# initialize the plot for first 100 CIs
plot(0, 
     xlim = c(1, nrep),
     ylim = c( 1 - sd(CIs[,1])*4.3, 1 + sd(CIs[,2])*4.3), 
     xlab = "Confidence intervals in repeated random samples", 
     ylab = expression(beta[1]~"="~1), 
     main = bquote("Simulating" ~ .(nrep) ~ "confidence intervals;" ~ 
                     .(length(ID)) ~ "out of" ~ .(nrep) ~ "do not cover the true value 1") )

# set up color vector
colors <- rep( "grey60", nrep) # color grey for all 100 CIs 
colors[ID] <- "salmon"         # color salman for CIs not covering the true value

# draw reference line at meta1=1
abline(h = 1)

# add first nrep=100 confidence intervals with corresponding x,y coordinates 
for(j in 1:nrep) {
  lines(x = c(j,j), y = c(CIs[j, 1], CIs[j, 2]),  
        col = colors[j], 
        lwd = 3.6)
}

Figure 3.8: Simulated confidence intervals

3.7.2 Confidence Intervals – Example

library(AER);  library(wooldridge);  data("wage1")

# we once more estimate the our wage model
summary(  out <- lm(lwage ~ educ + tenure + exper, data = wage1)  )

      
      Call:
      lm(formula = lwage ~ educ + tenure + exper, data = wage1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -2.05802 -0.29645 -0.03265  0.28788  1.42809 
      
      Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
      educ        0.092029   0.007330  12.555  < 2e-16 ***
      tenure      0.022067   0.003094   7.133 3.29e-12 ***
      exper       0.004121   0.001723   2.391  0.01714 *  
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.4409 on 522 degrees of freedom
      Multiple R-squared:  0.316,   Adjusted R-squared:  0.3121 
      F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16

# Now we calculate the 95% confidence intervals for all parameters with the R function confint() 
confint(out)

                         2.5 %     97.5 %
      (Intercept) 0.0796755675 0.48904351
      educ        0.0776292151 0.10642876
      tenure      0.0159896854 0.02814475
      exper       0.0007356984 0.00750652

As we can see, 0.1 is contained in the 95% CI for educ. Therefore, a 10% effect of one more year of education cannot be ruled out with a significance level of 5%. This is the same result as we get earlier with a corresponding t-test

# Calculating the 99% confidence intervals for all parameters
confint(out, level = 0.99)

                          0.5 %      99.5 %
      (Intercept)  0.0149981868 0.553720895
      educ         0.0730790806 0.110978896
      tenure       0.0140692670 0.030065169
      exper       -0.0003340459 0.008576264

3.8 Testing multiple linear restrictions

To give a motivation for this section it is best to start with an example

Suppose, we want to investigate whether the salary of baseball player depend on their achievements

\begin{align*} log(salary) = \beta_0 &+ \beta_1\cdot years + \beta_2 \cdot gamesyr \\ &+ \beta_3 \cdot bavg + \beta_4\cdot hrunsyr + \beta_5\cdot rbisyr + u \end{align*}

years = number of years as an active player
gamesyr = number of games played per year
bavg = batting average
hrunsyr = homeruns per year
rbisyr = runs batted in per year
Data set mlb1 from the Wooldridge package, n = 353 major league baseball players
The last three variables are measures of performance. We want to test whether the salaries of the players depend on these, given the years as player and the frequency of run-outs
Our null hypothesis is H_0: \beta_3=0,\ \beta_4=0, \ \beta_5=0 against the H_1: at least one parameter ≠ 0
As we have to deal with multiple restrictions in this case, our usual t-test is not applicable any more

One possible general approach is as follows – Wald or F Tests
- We first estimate the model without imposing the restrictions; this is the unrestricted version
- After that, we estimate the model again, this time enforcing the restrictions; this is the restricted version (actually, this step is not really required for a Wald test, but it is easier to explain the test idea this way)
  - In the case of zero restrictions, like \beta_j=0, \, \ldots, (called exclusion restrictions), this is accomplished very easily by simply dropping the corresponding variables from the model
  - In the case of more complicated restrictions involving several parameters, like \beta_i - \beta_j=0 or \beta_i+\beta_j=1, a reparameterization of the model is necessary: For the first case we could define a new parameter \theta=\beta_i - \beta_j and plug in for \beta_i=\theta+\beta_j into the model and collect the variables accordingly. Afterwards, we drop the variable corresponding to \theta
Clearly, imposing restrictions on the model leads to a poorer fit. The test idea is that the H_0 has to be rejected if the fit of the model worsens too much. In this case, the data, along with the model of course, are apparently not compatible with the null hypothesis
Hence, we are looking for a test statistic which appropriately measures the decrease in the model’s fit caused by the imposition of the restrictions and from which we know the distribution

The F-staistic

The following test statistic does the job

F \ = \ \frac{\left(SSR_r-SSR_{ur} \right) / m}{SSR_{ur} /(n-k-1)} \ \sim \ F(m, n-k-1) \tag{3.8}

Thereby SSR \equiv \sum\nolimits _{i=1}^n \hat u_i^2 is defined for both the restricted and unrestricted model

This statistic measures the percentage decease in the fit (the percentage increase in SSR), corrected for the degrees of freedom of the nominator – m is number of restrictions – and of the denominator – (n-k-1) are the degrees of freedom in calculating \hat \sigma_{ur}^2
There is an equivalent version of this test statistic in terms of the R^2 instead of SSR. But this version is only applicable if the dependent variable is the same for both the restricted and unrestricted model

F \ = \ \frac{(R_{ur}^2 - R_{r}^2) / m}{(1-R_{ur}^2) / (n-k-1)} \ \sim \ F(m, n-k-1) \tag{3.9}

This measures the decrease in the model’s fit in terms of a decrease in the R^2
Note, for both versions the nominator is always positive, as a (binding) restriction always reduces the model’s fit
One can show that the test statistic introduced above is F distributed with m and (n-k-1) degrees of freedom, as both the nominator as well as the denominator are (scaled) \chi^2 distributed random variables, which is the definition of a F-distribution. This test statistic is simply called F-statistic
Important Remark: The t-statistic introduced above is a special case of the F-statistic if we only test one restriction, i.e., m=1. Actually, in this case, the F-statistic is simply the square of the t-statistic. Therefore, any (two-tailed) t-test can also be formulated as a F-test

Wald statistic

As the denominator of the F-statistic is simply the estimate of \sigma_{ur}^2, which can be consistently estimated by OLS, there is an asymptotic version of the F-statistic, which we call W. Thereby, \hat \sigma_{ur}^2 in F is replaced by \sigma_{ur}^2 (assuming, the number of observations is large enough so that the estimate for \sigma_{ur}^2 has already converged to the true value) and F is further multiplied by m

W \ = \ \frac{\left(SSR_r-SSR_{ur} \right)}{\sigma_{ur}^2} \ \sim \ \chi^2(m) \tag{3.10}

W, which is called Wald statistic, is \chi^2 distributed as the denominator is no random variable any more but simply a scaling variable required so that W is actually \chi^2 distributed
Many econometric software packages let the user choose whether the F or \chi^2 version of the test should be used

3.8.1 Carrying out the test

Carrying out a F-test is basically the same the procedure as with a t-test
- We postulate a null hypothesis
- We determine the critical value as 1 minus the 0.95 quantil of the corresponding F or \chi^2 distribution
- We calculate the test statistic and check whether it falls into the rejection region, i.e., whether the test statistic is larger than the critical value
- Alternatively, we can calculate the p-value corresponding to the test statistic
The graphics below illustrates this procedure
- Note, that the F and \chi^2 distribution are very similar in this case because the denominator’s degrees of freedom is quite high (347). The numerator’s degrees of freedom is 3 – we test for three restrictions;
  - Both test statistics correspond to the baseball player example presented above. Further note the different scaling of the two distributions which amounts to 3
  - As shown after the figures, the F-statistic and W-statistic for the baseball player example are 9.55 and 28.65, respectively. Is the H_0 rejected?

Code

# Plot the F(3,100) Distribution on interval [-5,5]
df1 <- 3
df2 <- 353-6
end <- 5
a <- 0.05
hmax <- df( qf(0.3, df1, df2), df1, df2)  # not really hmax
t <- seq(0, 5, 0.01)
cv <- round( qf(1-a, df1, df2), digits = 2 )

plot(x = t, y = df(t, df1, df2),  type = "l",  col = "black",  lwd = 2, 
     main = bquote("Test with F(" ~ .(df1) ~ ", " ~ .(df2) ~ ") distribution and" ~ alpha ~ "=" ~ .(a)),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=0:end, pos = c(0,0) )
axis(side=2) 

# critical region in right tail
polygon(x = c(cv, seq(cv, 5, 0.01), 5),
        y = c( 0, df(seq(cv, 5, 0.01), df1, df2), 0 ), 
        col = 'cornflowerblue')




text(end*0.75, hmax/2, labels = bquote(alpha ~ "=" ~ .(a)),cex = 1)
arrows(end*0.75, hmax/2.2, end*0.65, hmax*0.05, length = 0.1)

text(end*0.15, hmax/2.8, labels = bquote( qf[0.95] ~ " = " ~ .(cv)), cex = 1)
arrows(end*0.15, hmax/3.2, cv*0.99, hmax*0.01, length = 0.08)

Code

# Plot the ChiSq(3) Distribution on interval [-5,5]
df <- 3
a <- 0.05
end <- df*5
hmax <- dchisq( qchisq(0.3, df), df)  # not really hmax
t <- seq(0, end, 0.01)
cv <- round( qchisq(1-a, df), digits = 2 )

plot(x = t, y = dchisq(t, df),  type = "l",  col = "black",  lwd = 2, 
     main = bquote("Test with ChiSq(" ~ .(df) ~ ") distribution and" ~ alpha ~ "=" ~ .(a)),
     ylab="", xlab="",  cex.lab = 1,  cex.main = 1.5,  axes=FALSE)
axis(side=1, at=0:end, pos = c(0,0) )
axis(side=2) 

# critical region in right tail
polygon(x = c(cv, seq(cv, end, 0.01), end),
        y = c( 0, dchisq(seq(cv, end, 0.01), df), 0 ), 
        col = 'cornflowerblue')

text(end*0.75, hmax/2, labels = bquote(alpha ~ "=" ~ .(a)),cex = 1)
arrows(end*0.75, hmax/2.2, end*0.65, hmax*0.05, length = 0.1)

text(end*0.15, hmax/2.8, labels = bquote( qchisq[0.95] ~ " = " ~ .(cv)), cex = 1)
arrows(end*0.15, hmax/3.2, cv*0.99, hmax*0.01, length = 0.08)

3.8.2 Baseball example continued

We estimate the baseball player model; unrestricted version

library(AER); library(wooldridge); data("mlb1")

unr <- lm( log(salary) ~  years + gamesyr + bavg + hrunsyr + rbisyr, data = mlb1)
summary(unr)

      
      Call:
      lm(formula = log(salary) ~ years + gamesyr + bavg + hrunsyr + 
          rbisyr, data = mlb1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -3.02508 -0.45034 -0.04013  0.47014  2.68924 
      
      Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 1.119e+01  2.888e-01  38.752  < 2e-16 ***
      years       6.886e-02  1.211e-02   5.684 2.79e-08 ***
      gamesyr     1.255e-02  2.647e-03   4.742 3.09e-06 ***
      bavg        9.786e-04  1.104e-03   0.887    0.376    
      hrunsyr     1.443e-02  1.606e-02   0.899    0.369    
      rbisyr      1.077e-02  7.175e-03   1.500    0.134    
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.7266 on 347 degrees of freedom
      Multiple R-squared:  0.6278,  Adjusted R-squared:  0.6224 
      F-statistic: 117.1 on 5 and 347 DF,  p-value: < 2.2e-16

Note that none of performance variables is statistically significant

We estimate the baseball player model, this time in its restricted version

r <- lm( log(salary) ~  years + gamesyr, data = mlb1)
summary(r)

      
      Call:
      lm(formula = log(salary) ~ years + gamesyr, data = mlb1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -2.66858 -0.46412 -0.01177  0.49219  2.68829 
      
      Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 11.223804   0.108312 103.625  < 2e-16 ***
      years        0.071318   0.012505   5.703  2.5e-08 ***
      gamesyr      0.020174   0.001343  15.023  < 2e-16 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.7527 on 350 degrees of freedom
      Multiple R-squared:  0.5971,  Adjusted R-squared:  0.5948 
      F-statistic: 259.3 on 2 and 350 DF,  p-value: < 2.2e-16

We calculate the F-statistic from Equation 3.8

SSRur <- sum( unr$residuals^2 )
SSRr <- sum( r$residuals^2 )
df <- summary(unr)$df[2]

Fstat <- (SSRr - SSRur)/SSRur*df/3
Fstat

      [1] 9.550254

# 95% critical value
qf(0.95, 3, df)

      [1] 2.630641

# p-value
1 - pf(Fstat, 3, df)

      [1] 4.473708e-06

The H_0 that none of the performance variables have any impact on the salary of the players is strongly rejected, although the variables on its own right are not significant – multicollinearity problem
Compare Figure 3.9

We carry out the same test with `lht()` (for linearHypothesis), contained in the `AER` package

Applying this procedure we notice that actually, it is not really necessary to estimate the restricted model to carry out the test. We only need the unrestricted model, as it was the case with the simple t-test

lht(unr, c("bavg=0", "hrunsyr=0", "rbisyr=0"))

      
      Linear hypothesis test:
      bavg = 0
      hrunsyr = 0
      rbisyr = 0
      
      Model 1: restricted model
      Model 2: log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr
      
        Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
      1    350 198.31                                  
      2    347 183.19  3    15.125 9.5503 4.474e-06 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We get the very same results as with the F-test carried out by hand

We calculate the W-statistic from Equation 3.10; the chi squared version

sig2ur <- SSRur/df 

Wstat <-  (SSRr - SSRur)/sig2ur
Wstat

      [1] 28.65076

# 95% critical value
qchisq(0.95,3)

      [1] 7.814728

# p-value
1 - pchisq(Wstat, 3)

      [1] 2.651604e-06

The same test with `lht()`, contained in the `AER` package; chi squared version

lht(unr, c("bavg=0", "hrunsyr=0", "rbisyr=0"), test = "Chisq")

      
      Linear hypothesis test:
      bavg = 0
      hrunsyr = 0
      rbisyr = 0
      
      Model 1: restricted model
      Model 2: log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr
      
        Res.Df    RSS Df Sum of Sq  Chisq Pr(>Chisq)    
      1    350 198.31                                   
      2    347 183.19  3    15.125 28.651  2.652e-06 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We get the same result as doing the testing by hand and nearly the same result as with the F-statistic version
Compare Figure 3.10

3.8.3 “the” F-statistic of a regression model

A special case of the F-test is a test of the overall significance of a regression with the

H_0: \beta_1 = \beta_2 = \ldots = \beta_k = 0

This null hypothesis states that the explanatory variables overall are not useful in explaining the dependent variable

This test of overall significance is routinely calculated and reported in most regression packages as “the” F-statistic; the null hypothesis is usually overwhelmingly rejected
If this H_0 cannot be rejected, the model is of no value in explaining the dependent variable

3.8.4 Additional examples

As a last example in this chapter we will re-estimate our wage equation

log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 tenure + u

We will test whether the effect of one additional year of education has a threefold effect than the effect of one additional year of tenure; H_0: \beta_1 = 3 \beta_3
- As this hypothesis involves only one restriction, a t-test could be used. However, the restriction affects two parameters, therefore, a reparameterization of the model with a variable transformation would be necessary to implement this restriction
- Using the R function lht() (for linearHypothesis) of the car and AER package implements this test very easily

library(AER);  library(wooldridge);  data("wage1")

# we first estimate the model
summary(  out <- lm(lwage ~ educ + tenure + exper, data = wage1)  )

      
      Call:
      lm(formula = lwage ~ educ + tenure + exper, data = wage1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -2.05802 -0.29645 -0.03265  0.28788  1.42809 
      
      Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
      educ        0.092029   0.007330  12.555  < 2e-16 ***
      tenure      0.022067   0.003094   7.133 3.29e-12 ***
      exper       0.004121   0.001723   2.391  0.01714 *  
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.4409 on 522 degrees of freedom
      Multiple R-squared:  0.316,   Adjusted R-squared:  0.3121 
      F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16

Note “the” F-statistic!

We test the H0: beta1 = 3 beta3

lht( out, "educ = 3*tenure")

      
      Linear hypothesis test:
      educ - 3 tenure = 0
      
      Model 1: restricted model
      Model 2: lwage ~ educ + tenure + exper
      
        Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
      1    523 102.29                              
      2    522 101.46  1   0.83518 4.2971 0.03867 *
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We additionally test whether beta1 = 0.1
If we have several restrictions, the restrictions must be put into a character vector with the ‘c()’ function

lht( out, c("educ = 3*tenure", "educ=0.1") )

      
      Linear hypothesis test:
      educ - 3 tenure = 0
      educ = 0.1
      
      Model 1: restricted model
      Model 2: lwage ~ educ + tenure + exper
      
        Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
      1    524 104.47                                  
      2    522 101.46  2    3.0199 7.7688 0.0004735 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These two restrictions together on the model are clearly not compatible with the data anymore

3.9 Tests in matrix notation

Wald test statistic with general linear hypothesis in matrix notation

Every set of linear hypothesis can be stated as:

\mathbf R \boldsymbol{\hat{\beta}} = \mathbf c \tag{3.11}

Thereby, the restriction matrix \mathbf R has dimensions m x (k+1), with m being the number of restrictions. Consider the following example:

\underbrace{\left[\begin{array}{cccc} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{array}\right]}_{R}\left[\begin{array}{c} \hat{\beta}_{0} \\ \hat{\beta}_{1} \\ \hat{\beta}_{2} \\ \hat{\beta}_{3} \end{array}\right] = \underbrace{\left[\begin{array}{c} 0 \\ 0 \end{array}\right]}_{c}

Multiplying out we see that, in this case, we test whether \hat \beta_2=0 and \hat \beta_3=0.

We have the following important Theorem:

Theorem 3.3 (Distribution of a quadratic form) Let \mathbf z be a N(\mathbf 0, \mathbf V) normal distributed random vector with h elements. Then the quadratic form

\mathbf z' \mathbf V^{-1} \mathbf z

is distributed as

\mathbf z' \mathbf V^{-1} \mathbf z \ \sim \ \chi^2(h) (see Appendix A.6 and A.9).

Assuming that \sigma^2 is a known constant and applying covariance formula from Equation C.7 to the restriction vector (R \boldsymbol{\hat{\beta}} - \mathbf c), we get the covariance matrix of this random vector:

\sigma^{2} \mathbf{R} \left( \mathbf{X}' \mathbf{X} \right)^{-1} \mathbf R' \tag{3.12}

As \boldsymbol{\hat \beta} is asymptotically normally distributed (or normally distributed in small samples if \mathbf u is normally distributed), provided H_0 is true, the restriction vector is

(R \boldsymbol{\hat{\beta}} - \mathbf c) \ \sim \ N(\mathbf 0, \sigma^{2} \mathbf{R} \left( \mathbf{X}' \mathbf{X} \right)^{-1} \mathbf R')

Applying Theorem 3.3 from above we conclude that the following must hold:

W \ = \ \underbrace{(\mathbf{R} \hat{\boldsymbol{\beta}}-\mathbf{c})^{\prime}}_{z^{\prime}} \underbrace{\left(\sigma^{2} \mathbf{R}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{R}^{\prime}\right)^{-1}}_{V^{-1}} \underbrace{(\mathbf{R} \hat{\boldsymbol{\beta}}-\mathbf{c})}_{z} \ \sim \ \chi^{2}(m) \tag{3.13}

This Wald test statistic is a weighted sum (by the inverse of the covariance matrix) of the squared distances (and cross products) between the hypothesized values \mathbf c and the actual estimation outcomes \mathbf R \boldsymbol{\hat \beta}. If this normed distance measure is too large, H_0 is rejected.

For m = 1 (only one restriction), \sqrt W reduces to a simple t-statistic.

If \sigma^2 is unknown, it has to be estimated by \hat \sigma^2={\sum_{i=1}^n \hat u_i^2} \ / \ {(n-k-1)}.

With the help of the above theorem, it can be shown that this estimator (after scaling with \sigma^2) is distributed as \chi^2(n-k-1)
Hence, in this case, the Wald-statistic Equation 3.13 is the ratio of two \chi^2, independently distributed random variables and therefore is F-distributed after appropriate scaling (see Appendix A.7):

F \ = \ \frac{(\mathbf R \boldsymbol{\hat{\beta}} - \mathbf c)' \left(\hat\sigma^{2} \mathbf R\left(\mathbf X' \mathbf X\right)^{-1} \mathbf R'\right)^{-1}(\mathbf R \boldsymbol{\hat{\beta}} - \mathbf c)}{m} \ \sim \ F(m,n-k-1) \tag{3.14}

Note, in small samples this formula is valid only if \mathbf u is normally distributed. In large samples, Equation 3.14 times m converges to Equation 3.13 and so the two test statistics are asymptotically equivalent.

Furthermore, it can be shown that the formula above is equivalent to:

F \ = \ \frac{\left(\hat{\mathbf{u}}_{r}^{\prime} \hat{\mathbf{u}}_{r}-\hat{\mathbf{u}}_{ur}' \hat{\mathbf{u}}_{ur} \right) / m}{\hat{\mathbf{u}}'_{ur} \hat{\mathbf{u}}_{ur} /(n-k-1)} \ \sim \ F(m, n-k-1) \tag{3.15}

The interpretation of this test-statistic is straightforward: If the percentage decrease of the fit (measured by SSR) of the restricted model is too large compared to the model without imposing the restrictions, the restricted model is rejected.

--- title-block-banner: true subtitle: "Mainly based on @wooldridge_Intro_Econometrics, Chapter 4" --- # Inference \pagebreak ## Topics of this chapter This chapter deals with statistical inference in the regression model, particularly with - **Hypothesis tests** about population parameters - Construction of **confidence intervals** For both we need the **distribution** of the OLS estimators for repeated samples - The OLS estimators $\hat \beta_j$ and $\hat \sigma^2$ depend on the particular sample and are therefore **random variables** - We already know their *expected values* and their *variances* - However, for hypothesis tests we need to know their *distribution* - In order to derive their distribution we need *additional* assumptions, at least for *small* samples - MLR.6: Errors are normally distributed $$ u_i \ \sim \ N(0, \sigma^2) $$ which implies that the depended variable $y$ is normally distributed as well, conditional on the values of the explanatory variables in $\mathbf x$ $$ y_i \, | \, \mathbf x \ \sim \ N(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k, \, \sigma^2) $$ ------------------------------------------------------------------------ ## Statistical inference in general Discussion of the normality assumption - The error term $u_i$ is interpreted as the sum of **many different unobserved factors** - According to the *Central Limit Theorem* (**CLT**, see @thm-CLT), sums (means) of many independent factors tends to be normally distributed - Problems: - Are the conditions for the CLT met? In particular: - How many different factors? Is the number large enough? - Possibly very heterogeneous distributions of individual factors - How independent are the different factors from each other? The normality of the error term is an *empirical question* - At least the error distribution should be "close” to *normal* - In many cases, normality is questionable or even impossible *by definition* - Examples where normality cannot strictly hold: - Wages (non-negative; also: minimum wage) - Number of arrests (takes on a small number of integer values) - Employed/not employed (indicator, respectively, binary dependent variable, takes on only 1 or 0) - In some cases, normality can be achieved through transformations of the dependent variable (e.g. use log(wage) instead of wage) > **Very important**: Under certain regularity conditions regarding the data and $u_i$ the assumption of **normality** can be *replaced* by a **large sample size**. In that case the error term needs not to be normally distributed. ------------------------------------------------------------------------ The following terminology is often used: $$ \underbrace{\text{MLR.1 - MLR.5}}_{\text{Gauss-Markov assumptions}} \quad \quad \underbrace{\text{MLR.1 - MLR.6}}_{\text{Classical linear model (CLM) assumptions}} $$ ------------------------------------------------------------------------ Furthermore, we have the following important theorem: ::: {#thm-normal} ## Normal sampling distributions Under assumptions MLR.1 - MLR.6 $$ \hat \beta_j | \, \mathbf x \ \sim \ N \left( \beta_j, \operatorname{Var}(\hat \beta_j) \right) $$ {#eq-betahat_normal} That means that, **given the assumptions above**, the OLS estimator $\hat \beta_j$ is *unbiased* and conditionally *normally distributed* around the *true* parameter with a variance that was derived earlier. Additionally: $$ \dfrac {\hat \beta_j - \beta_j}{sd(\hat \beta_j)} \ \sim \ N(0,1) $$ {#eq-standard_beta_normal} This means that the *standardized* estimator follows a *standard* normal distribution ::: ------------------------------------------------------------------------ ::: {.callout-tip collapse="true" icon="false"} ## Proof of @thm-normal The proof is quite straightforward if we use matrix notation, see @sec-matrix The model in matrix notation is: $$ \mathbf y = \mathbf X \boldsymbol \beta + \mathbf u $$ And the OLS estimator for $\boldsymbol{\hat \beta}$ is (see @eq-C_betahat): $$ \hat {\boldsymbol \beta} = (\mathbf X' \mathbf X)^{-1} \mathbf X' \mathbf y $$ This can be written as (@eq-C_betahat_alternatine): $$ \hat {\boldsymbol \beta} = \boldsymbol \beta + \underbrace {(\mathbf X' \mathbf X)^{-1} \mathbf X' \mathbf u}_{\mathbf {Au}} \tag{3.9} $$ - The typical element of the vector $\mathbf {Au}$ is $\sum_{j=1}^n a_{ij} \, u_j$ - Hence, conditional on $\mathbf X$, the OLS estimator $\hat {\boldsymbol \beta}$ is a linear combination of the normally distributed errors $u_{j}$ - According to the **Reproductive Property** of the normal distribution, $\hat {\boldsymbol \beta}$ therefore is normally distributed as well: $$ \hat {\boldsymbol \beta} \, | \, \mathbf X \ \sim \ N \left( \boldsymbol \beta, \ \sigma^2 (\mathbf X' \mathbf X)^{-1} \right) $$ - Here, we used the fact that under MLR.1 - MRL.4, $\hat {\boldsymbol \beta}$ is unbiased - For the derivation of the covariance matrix of $\hat {\boldsymbol \beta}$ we used MLR.5 and formula @eq-C_covmatrix_lin_combination ::: ------------------------------------------------------------------------ ## Testing hypotheses about a single population parameter We have the following theorem: ::: {#thm-t_distribution} ## t-distribution for the standardized estimators Given MLR.1 - MLR.6, we state: $$ \dfrac {\hat \beta_j - \beta_j}{se(\hat \beta_j)} \ \sim \ t_{n-k-1} $$ {#eq-standbeta_t_distribution} ::: - That means that if the standardization in @eq-standard_beta_normal is done using the **estimated** *standard deviation* of $\hat \beta_j$, the normal distribution from @eq-standard_beta_normal is replaced by a t-distribution - The **estimated** *standard deviation* of $\hat \beta_j$ is called **standard error** - The standard error, se($\hat \beta_j$), is calculated with $\hat \sigma^2$ instead of the true $\sigma^2$ in @eq-se_beta_hat_hat. Therefore, se($\hat \beta_j$) is itself the square root of a random variable, $\hat \sigma ^2$, which is usually $\chi^2$ distributed - In this case, we actually have a **ratio** of **two** random variables in @eq-standbeta_t_distribution -- a normal over the square root of a $\chi^2$ distributed random variable, which generally *leads to a t-distribution* (under certain conditions). That's our heuristic **proof** of @thm-t_distribution - A very **important special case** arises if we test the particular null hypothesis $H_0 : \beta_j=0$ - This means that we *hypothesize* that the **true** population parameter is **equal to zero**, i.e. **there is no effect of** $x_j$ **on** $y$ *after controlling for the other independent variables* - This is the so called **Significance Test**. In this case, presuming the $H_0$ is true, @eq-standbeta_t_distribution reduces to `the` **t-statistic** ::: {.callout-important appearance="simple" icon="false"} $$ t_{\hat \beta_j} \ \equiv \ \dfrac {\hat \beta_j - 0}{se(\hat \beta_j)} \ = \ \dfrac {\hat \beta_j}{se(\hat \beta_j)} \ \sim \ t_{n-k-1} $$ {#eq-the_t-stat} ::: - If we, more generally, hypothesize that $\beta_j = a_j$, the test statistic is (*presuming the* $H_0$ *is true*) $$ \dfrac {\hat \beta_j-a_j} {se(\hat \beta_j)} \ \sim \ t_{n-k-1} $$ {#eq-eq-t-stat_gen} > **Generally, the t-statistic is a standardized measure of the distance between the estimated parameter and its hypothesized value** (which is zero or $a_j$) > In particular, it measures **how many estimated standard deviations the estimated coefficient is away from the** $H_0$ (0 or $a_j$) - The **basic test idea** is the following: The farther the estimated coefficient is away from the hypothesized value zero (or $a_j$), the less likely it is that the null hypothesis holds true - But what does “far away" from zero (or $a_j$) mean? This is *measured* by the t-statistic above and *assessed* with the help of the distribution of the t-statistic - If the $H_0$ is true, we know from @thm-t_distribution that the distribution of the t-statistic is the t-distribution (so, the name of the t-statistic is not accidentally) - If the degrees of freedom of the t-distribution are greater than 60 (many observations), the t-distribution is practical identical to a standardized $N(0,1)$ normal distribution ------------------------------------------------------------------------ ### Testing procedure for a single parameter How is a test procedure actually carried out? 1. From @thm-t_distribution, we know the distribution of the test statistic 2. We carry out the estimation of $\beta_j$ and se($\beta_j$) and calculate the test statistic @eq-eq-t-stat_gen 3. We want to test a hypothesis $H_0$ about the true value of the estimated parameter -- actually we want to reject the $H_0$ 4. The test statistic shows how far away the estimate $\hat \beta_j$ is from the hypothesized value 5. To asses whether the test statistic is far away enough to reject the $H_0$ we construct a **rejection region** so that, if the *null hypothesis is* **true**, the test statistic only *accidentally* falls into this rejection region, for instance in only 5% of the cases. This small probability is freely chosen and called **significance level** i. The *significance level* $\alpha$ is the probability of making a so called **type I error** or $\alpha$**-error**, which is the probability to **falsely reject a true hypothesis accidentally** (only due to sampling errors). In most cases $\alpha$=0.05 (5%) is chosen, but $\alpha$=0.01 (1%) or $\alpha$=0.1 (10%) are also quite common ii. The *boundary points* of the rejection region are called **critical values**. These are the values which are exceeded (deceeded) in x% of the cases. Formally, we are looking for the corresponding quantile of the test statistic's distribution 6. Having constructed the rejection region of the test, we *reject* the null hypothesis $H_0: \beta_j=a_j$ in favor of the *alternative hypothesis* if the *test statistic falls into this rejection region*. In this case the estimated value is “*too far away*” from the hypothesized value 7. There are *two types* of alternative hypothesis $H_1$: i. $H_1: \beta_j \neq a_j$ for a *two sided* or *two-tailed* test ii. either $H_1: \beta_j>a_j$ or $H_1: \beta_j<a_j$ for a *one sided* or *one-tailed* test iii. For one-sided tests, it is sometimes not obvious, whether one could apply a left- or right sided one. There is a simple rule: If the t-statistic is negative, we can only apply a left-sided test, and if the t-statistic is positive, we can only apply a right-sided test. Otherwise, the $H_0$ could never be rejected 8. Two-tailed tests are much more common. One-tailed tests are used if the researcher is only interested in a particular "direction" of the parameter, e.g., with a unit-root test (a particular test for the stationarity of time series) one is only keen on whether the parameter of interest might be $\geq 1$ ------------------------------------------------------------------------ ```{r, fig.align='center', fig.height=5.5, fig.width=9} #| code-fold: true #| label: fig-one_tailed #| fig-cap: "Distribution of the t-statistic - in this case a normal distribution" # Plot the normal distribution on interval [-5,5] t <- seq(-5, 5, 0.01) cvl <- round( qnorm(0.95), digits = 2) # critical value as quantile of the norm. distrib. plot(x = t, y = dnorm(t), type = "l", col = "black", lwd = 2, main = expression("One-tailed test with normal distrib. and" ~ alpha ~ "= 0.05; e.g., H"[0]: ~ beta ~ "≤ 0, H"[1]: ~ beta ~" > 0"), ylab="", xlab="", cex.lab = 1, cex.main = 1.5, axes=FALSE) axis(side=1, at=-5:5, pos = c(0,0)) axis(side=2) # critical region in right tail polygon(x = c(cvl, seq(cvl, 5, 0.01), 5), y = c( 0, dnorm(seq(cvl, 5, 0.01)), 0 ), col = 'cornflowerblue') # text text(3.5, 0.15, labels = expression( "Rejection region," ~ alpha ~ "= 0.05"),cex = 1) arrows(3.6, 0.133, 2.3, 0.04, length = 0.08) text(0, 0.1, labels = bquote( qn[0.95] ~ " = " ~ .(cvl)), cex = 1) arrows(0, 0.085, cvl-0.03, 0.002, length = 0.08) text(-3.5, 0.3, labels = expression("Distribution of the test statistic"),cex = 1) arrows(-3.6, 0.28, -1.22, 0.2, length = 0.08) ``` ------------------------------------------------------------------------ ```{r, fig.align='center', fig.height=5.5, fig.width=9} #| code-fold: true #| label: fig-two_tailed #| fig-cap: "Distribution of the t-statistic - in this case a normal distribution" # Plot the normal distribution on interval [-5,5] t <- seq(-5, 5, 0.01) # lower and upper critical values cvl <- round( qnorm(0.025), digits = 2 ); cvr <- round( qnorm(0.975), digits = 2) plot(x = t, y = dnorm(t), type = "l", col = "black", lwd = 2, main = expression("Two-tailed test with normal distrib. and" ~ alpha ~ "= 0.05; e.g., H"[0]: ~ beta ~ "= 0, H"[1]: ~ beta ~" ≠ 0"), ylab="", xlab="", cex.lab = 1, cex.main = 1.5, axes=FALSE) axis(side=1, at=-5:5, pos = c(0,0)) axis(side=2) # critical region in left tail polygon(x = c(-5, seq(-5, cvl, 0.01), cvl), y = c( 0, dnorm(seq(-5, cvl, 0.01)), 0 ), col = 'cornflowerblue') # critical region in right tail polygon(x = c(cvr, seq(cvr, 5, 0.01), 5), y = c( 0, dnorm(seq(cvr, 5, 0.01)), 0 ), col = 'cornflowerblue') text(-4, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1) arrows(-4, 0.13, -2.5, 0.026, length = 0.08) text(4, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1) arrows(4, 0.13, 2.5, 0.026, length = 0.08) text(-0.5, 0.15, labels = bquote( qn[0.025] ~ " = " ~ .(cvl)), cex = 1) arrows(-0.5, 0.135, cvl+0.03, 0.002, length = 0.08) text(0.5, 0.1, labels = bquote( qn[0.975] ~ " = " ~ .(cvr)), cex = 1) arrows(0.5, 0.085, cvr-0.03, 0.002, length = 0.08) ``` ------------------------------------------------------------------------ ```{r, fig.align='center', fig.height=5.5, fig.width=9} #| code-fold: true #| label: fig-two_tailed1 #| fig-cap: "Distribution of the t-statistic - in this case a t-distribution" # Plot the t(4) Distribution on interval [-5,5] t <- seq(-5, 5, 0.01) # lower and upper critical values cvl <- round( qt(0.025, 4), digits = 2 ); cvr <- round( qt(0.975, 4), digits = 2) plot(x = t, y = dt(t, 4), type = "l", col = "black", lwd = 2, main = expression("Two-tailed test with t(4) distribution and" ~ alpha ~ "= 0.05; e.g., H"[0]: ~ beta ~ "= 0, H"[1]: ~ beta ~" ≠ 0"), ylab="", xlab="", cex.lab = 1, cex.main = 1.5, axes=FALSE) axis(side=1, at=-5:5, pos = c(0,0)) axis(side=2) # critical region in left tail polygon(x = c(-5, seq(-5, cvl, 0.01), cvl), y = c( 0, dt(seq(-5, cvl, 0.01), 4), 0 ), col = 'cornflowerblue') # critical region in right tail polygon(x = c(cvr, seq(cvr, 5, 0.01), 5), y = c( 0, dt(seq(cvr, 5, 0.01), 4), 0 ), col = 'cornflowerblue') text(-4.5, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1) arrows(-4.5, 0.13, -3.8, 0.02, length = 0.08) text(4.5, 0.15, labels = expression("0.025"~"="~over(alpha, 2)),cex = 1) arrows(4.5, 0.13, 3.8, 0.02, length = 0.08) text(-2.2, 0.18, labels = bquote( qt[0.025] ~ " = " ~ .(cvl)), cex = 1) arrows(-2.2, 0.165, cvl+0.03, 0.0015, length = 0.08) text(0.55, 0.09, labels = bquote( qt[0.975] ~ " = " ~ .(cvr)), cex = 1) arrows(0.65, 0.075, cvr-0.03, 0.002, length = 0.08) ``` ```{r, fig.align='center', fig.height=5.5, fig.width=9} #| code-fold: true #| label: fig-two_tailed2 #| fig-cap: "Distribution of the t-statistic - in this case a t-distribution " # Plot the t(3) Distribution on interval [-5,5] t <- seq(-5, 5, 0.01) # lower and upper critical values cvl <- round( qt(0.01, 4), digits = 2 ); cvr <- round( qt(0.99, 4), digits = 2) plot(x = t, y = dt(t, 4), type = "l", col = "black", lwd = 2, main = expression("Two-tailed test with t(4) distribution and" ~ alpha ~ "= 0.02; e.g., H"[0]: ~ beta ~ "= 0, H"[1]: ~ beta ~" ≠ 0"), ylab="", xlab="", cex.lab = 1, cex.main = 1.5, axes=FALSE) axis(side=1, at=-5:5, pos = c(0,0)) axis(side=2) # critical region in left tail polygon(x = c(-5, seq(-5, cvl, 0.01), cvl), y = c( 0, dt(seq(-5, cvl, 0.01), 4), 0 ), col = 'cornflowerblue') # critical region in right tail polygon(x = c(cvr, seq(cvr, 5, 0.01), 5), y = c( 0, dt(seq(cvr, 5, 0.01), 4), 0 ), col = 'cornflowerblue') text(-4.5, 0.15, labels = expression("0.01"~"="~over(alpha, 2)),cex = 1) arrows(-4.5, 0.13, -4.1, 0.015, length = 0.08) text(4.5, 0.15, labels = expression("0.01"~"="~over(alpha, 2)),cex = 1) arrows(4.5, 0.13, 4.1, 0.015, length = 0.08) text(-2.2, 0.18, labels = bquote( qt[0.01] ~ " = " ~ .(cvl)), cex = 1) arrows(-2.2, 0.165, cvl+0.03, 0.0015, length = 0.08) text(0.55, 0.09, labels = bquote( qt[0.99] ~ " = " ~ .(cvr)), cex = 1) arrows(0.65, 0.075, cvr-0.03, 0.002, length = 0.08) ``` ------------------------------------------------------------------------ ### Example for tests - wage equation ##### **We first estimate the model -- in this case for explaining the received wage of persons** ```{r, message=FALSE} #| comment: " " library(AER); library(wooldridge); data("wage1") output <- summary( lm(lwage ~ educ + tenure + exper, data = wage1) ) output ``` ------------------------------------------------------------------------ ##### **We want to test whether `tenure` has any effect on lwage** - Significance Test: H0: b3 = 0, H1: b3 ≠ 0, **two-tailed** - Choosing the significance level alpha = 5% ```{r} #| comment: " " alpha <- 0.05 # printing the coefficients, standard errors and t-statistic coef(output) # "the" t-statistic: b3 / se(b3) coef(output)[3,1] / coef(output)[3,2] # the 2.5% and 97.5% quantile of the standardized normal distribution: rejection region in both tails # Rule of Thumb for two-tailed tests: reject the H0 if |t-statistic| ≥ 2 c( qnorm(alpha/2), qnorm(1-alpha/2) ) ``` The coefficient of `tenure` is `r round(coef(output)[3,1], 3)` and therefore quite small in absolute terms. So, one might conclude that the coefficient could also be 0. However, the t-statistic is about 7, which means that the estimated coefficient for `tenure` is roughly seven times its standard deviation away from zero — which is very substantial. The critical values are about $\pm2$. Hence, the t-statistic clearly falls into the rejection region; thus the $H_0$ of no effect of `tenure` on `lwage` is *overwhelming* rejected. Therefore, $H_1$ is very likely to be true, subject to a small type I error of course. ------------------------------------------------------------------------ ##### **We want to test whether `tenure` has really a positive effect on lwage** - One sided test: H0: b3 ≤ 0, H1: b3 \> 0 -- **right-tailed** - We choose a significance level alpha = 5% ```{r} #| comment: " " alpha <- 0.05 # printing the coefficients, standard errors and t-statistic coef(output) # the t-statistic: b3 / se(b3) coef(output)[3,1] / coef(output)[3,2] # the 95% quantil of the standardized normal distribution: rejection region only in the right tail qnorm(1-alpha) ``` The t-statistic is about 7, which means that the estimated coefficient for `tenure` is about seven times the standard deviations of the estimate away from 0. The critical values is about 1.64. Hence, the t-statistic clearly falls into the rejection region, thus the $H_0$ of no or *a negative* effect of `tenure` on `lwage` is overwhelming rejected. So, $H_1$ seems to be true, subject to a small type I error of course. ------------------------------------------------------------------------ ##### **We want to test whether `educ` could have an effect on lwage of at least 10%:** - H0: b2 ≥ 0.1, H1: b2 \< 0.1 -- **left-tailed** (because estimated coefficient is less than 0.1) - Choosing the significance level alpha = 5% ```{r} #| comment: " " alpha <- 0.05 #| comment: " " # printing the coefficients, standard errors and t-statistic coef(output) # the t-statistic: (b2 - 0.1) / sd(b2) ( coef(output)[2,1] - 0.1 ) / coef(output)[2,2] # the 5% quantil of the standardized normal distribution: rejection region in the left tail qnorm(alpha) ``` The t-statistic does not fall into the rejection region. Therefore $H_0$ is not rejected - so not enough evidence to be sure of $H_1$ However, b2 \< 0.1, $H_1$, might be true anyway -- type II error ------------------------------------------------------------------------ ## Types of errors and power of statistical tests {#sec-typesoferrors} - There are principally *two* types of errors in statistical tests - **Type I error** or $\alpha$-error - This is the probability to falsely *reject* a *true* $H_0$ due to sampling errors. This probability is chosen via the **significance level** by the researcher. So, why not choosing a very low significance level $\alpha$ to minimize this error? In this case, rejecting the $H_0$ would give us more confidence regarding the rejection. But the *downside* is that a low significance level increases another type of error - **Type II error** or $\beta$-error - This is the probability of *accepting* a *false* $H_0$. Looking at @fig-two_tailed1 and @fig-two_tailed2 it is pretty clear that reducing the significance level, for instance from $\alpha=$ 0.05 to 0.02, shrinks the rejection region and therefore enlarges the acceptance region. In this case, the probability of rejecting the $H_0$ in repeated samples is decreased even if the $H_0$ is false - The **power of a test** is the probability to reject a false $H_0$. Hence, this is *one minus the probability of making a type II error* and one of the most important and desirable properties of a test -- it is the ability of the test to discriminate between true and false hypothesis - A *low significance level*, as well as a *small standardized distance between the hypothesized and estimated values*, **decreases** the power of a test. Therefore, all factors that reduce the standard errors of the estimates — most notably a large number of observations and, respectively, low multicollinearity or large variation in the explanatory variables — **increase** the test’s power ------------------------------------------------------------------------ ## p-value The *classical test strategy* of 1) choosing a *significance level* 2) calculating the test statistic 3) looking whether the *test statistic is within the rejection region*\ (i.e. whether \|test statistic\| ≥ \|critical value\|) is somewhat arbitrary as the result of the test depends on the chosen significance level - As an **example**, suppose the observed *t-statistic* for a significance test, i.e. $H_0: \beta_j=0$ versus $H_1: \beta_j≠0$, equals 1.85 - Using a significance level of 0.05, the $H_0$ cannot be rejected -- the corresponding critical values, $q_{0.025}, q_{0.975}$, are about ±2 - However, using a significance level of 0.1, the $H_0$ can be rejected -- the corresponding critical values, $q_{0.05}, q_{0.95}$, are about ±1.65 - So, what conclusions should be drawn? ------------------------------------------------------------------------ - The concept of the **p-value** provides an *alternative* to this and answers the following question: *given the observed t-statistic, what is the* ***smallest*** *significance level at which the* $H_0$ would be ***rejected?*** - This has the following interpretation: - *The p-value is the probability of observing a t-statistic as we did (or even larger in absolute terms) if the* $H_0$ is true - Hence, a *small* p-value is evidence *against* the $H_0$ as the actually observed (large) t-statistic is very unlikely a result of sampling errors in this case, and vice verse - Alternatively: A *small* p-value is evidence *against* the null hypothesis because one can reject the null hypothesis even at small significance levels (small error probability) - To calculate a p-value by hand is seldom necessary (the econometric package usually does the work). It could be done the following way, which, in a sense, is a reversal of the classical test strategy - Construct the rejection region, using ± of the *t-statistic as critical values* - Calculate the *implied significance level*. This is done with the **cdf()** (cumulative distribution function). For our example, the implied significance level is: $\operatorname {cdf}(-1.85) + (1 - \operatorname {cdf}(1.85))=0.072$.\ The R-code for an assumed $t_{40}$ distribution is $[1-pt(\operatorname {abs}(1,85), 40)] \times 2=0.072$ - The following Figure @fig-pvalue illustrates the concept of the p-value ------------------------------------------------------------------------ ```{r, fig.align='center', fig.height=5.5, fig.width=9} #| code-fold: true #| label: fig-pvalue #| fig-cap: "Distribution of the t-statistic" # Plot of prob.value b=0 # beta hat sd=1 # se(beta hat) t=1.85 df=40 a=round( (1-pt(abs(t), df))*2, digits = 3 ) # prob value #a=0.036 # significance level# beg=round(-4) end=round(4) maxh=dnorm(0) x <- seq(beg, end, 0.01) cvl <- -t cvr <- t plot(x = x, y = dt(x, df), type = "l", col = "black", lwd = 2, main = bquote( "Observed t-statistic = 1.85," ~ t[40] ~ "distrib. => p-Value = [1 - pt(| t |, 40)]*2 = " ~ .(round( a, digits=3))), ylab="", xlab="", cex.lab = 1, cex.main = 1.5, axes=FALSE) axis(side=1, at=beg:end, pos = c(0,0)) axis(side=2) # critical region in left tail polygon(x = c(beg, seq(beg, cvl, 0.01), cvl), y = c( 0, dt(seq(beg, cvl, 0.01), df), 0 ), col = 'cornflowerblue') # critical region in right tail polygon(x = c(cvr, seq(cvr, end, 0.01), end), y = c( 0, dt(seq(cvr, end, 0.01), df), 0 ), col = 'cornflowerblue') # left text(beg+1, maxh*0.3, labels = bquote( over(alpha,2) ~ " = 1 - pt(| t |, 40) = " ~ .(a/2) ), cex = 1) arrows(beg+1, maxh*0.26, cvl*1.18 , maxh*0.045, length = 0.08) # right text(end-1, maxh*0.3, labels = bquote( over(alpha,2) ~ " = 1 - pt(| t |, 40) = " ~ .(a/2) ), cex = 1) arrows(end-1, maxh*0.26, cvr*1.18, maxh*0.045, length = 0.08) # acceptance text(end-1, maxh*0.85, labels = bquote( "(1 -" ~ alpha ~ ") = " ~ .(1-a) ), cex = 1) arrows(end-1, maxh*0.81, end*0.1, maxh*0.6, length = 0.08) # left text( -0.8, maxh*0.35, labels = bquote( .(-t)), cex = 1) arrows(-0.8, maxh*0.31, -t, maxh*0.003, length = 0.08) # right text(0.8, maxh*0.35, labels = bquote( .(t)), cex = 1) arrows(0.8, maxh*0.31, t, maxh*0.003, length = 0.08) text(-2.8, 0.3, labels = bquote("Distribution of the test statistic," ~ t[40]),cex = 1) arrows(-2.8, 0.28, -1.12, 0.22, length = 0.08) ``` ------------------------------------------------------------------------ ## General remarks on parameter testing **Guidelines for discussing economic and statistical significance** - All statistical tests described in this chapter require at least MLR.1 - MLR.5 (for small samples, we additionally need MLR.6). *Violations of some assumptions need special treatments* - If a variable is statistically significant, discuss the *magnitude* of the coefficient to get an idea of its *economic or practical importance* - The fact that a coefficient is statistically significant does not necessarily mean it is economically or practically significant! - This is especially true for data sets with very many observations. For instance, if you have thousands of observations in a cross section data set, *nearly every (true) effect is statistically significant, even the most trivial ones* - If a variable is statistically *and* economically important but has the “wrong” sign, the regression model might be misspecified - If a variable is statistically insignificant at the usual levels (10%, 5%, or 1%), one could consider dropping it from the regression - However, if the sample size is small effects might be imprecisely estimated so, the case for dropping insignificant variables is less strong ------------------------------------------------------------------------ ## Confidence intervals Until now, we have dealt with *point estimates* of the unknown parameters. But according @thm-normal, we know that $$ {\hat \beta_j} \ \sim \ N\left(\beta_j, \, sd(\hat \beta_j)^2\right) $$ Now, imagine we want to construct an *interval estimate* of the true parameter $\beta_j$ which covers the true parameter with a probability of say, 95% (provided our model is correctly specified, of course) - To do so, we need the 0.025 and 0.975 quantils of this distribution. It turns out that this quantils are simply the ones of the standardized normal distribution (or the t distribution, if $\sigma^2$ is unknown and has to be estimated) *times the estimated standard deviation of* $\hat \beta_j$, i.e., $se(\hat \beta_j)$ - This argument leads to the so called **95% Confidence Intervals**, which *covers the true parameter with a probability of 95%*. As the distribution at hand is symmetric, we have $q_{0.025}=-q_{0.975}$ and therefore $$ \beta_j \ \ \underset {95\%} \in \ \ \left[ \hat \beta_j \ ± \ q_{0.975} \times se(\hat \beta_j) \right] $$ {#eq-confinterval} This way, we heuristically constructed an *interval estimate* for $\beta_j$, *considering sample variability*. Thereby, $q_{0.975}$ denotes the 0.975 quantil of a standardized normal or a $t_{n-k-1}$ distribution (for fewer than 60 degrees of freedom) ------------------------------------------------------------------------ - As the 0.975 quantil is about 2 of most cases (except for very low degrees of freedom), we can **approximate** the 95% confidence interval by a very **simple formula**, which is often used in practice ::: {.callout-important appearance="simple" icon="false"} $$ \beta_j \ \ \underset {95\%} \in \ \ \left[ \hat \beta_j \ ± \ 2 \times se(\hat \beta_j) \right] $$ {#eq-confinterval1} ::: - For a 99% confidence interval, use 2.6, and for a 90% confidence interval use 1.65 instead of 2. These approximations work very well for more than 30 degrees of freedom - For a more formal derivation of confidence intervals, we start with @thm-t_distribution which implies: $$ P\left(-q_{0.975} \ < \ \dfrac{\hat{\beta}_{j}- \beta_j }{ se(\hat \beta_j)} \ < \ q_{0.975}\right)=0.95 \ \ \Rightarrow\left[\times se(\hat \beta_j)\right] \\ P\left(-q_{0.975} \times se(\hat \beta_j) \ < \ \hat{\beta}_{j}- \beta_j \ < \ q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow\left[-\hat{\beta}_{j}\right] \\ P\left(-q_{0.975} \times se(\hat \beta_j)-\hat{\beta}_{j} \ < \ -\beta_j \ < \ -\hat{\beta}_{j}+q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow[ \times -1] \\ P\left(q_{0.975} \times se(\hat \beta_j)+\hat{\beta}_{j} \ > \ \beta_j > \ \hat{\beta}_{j}-q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow $$ $$ P\left(\hat{\beta}_{j}-q_{0.975} \times se(\hat \beta_j) \ < \ \beta_{j} \ < \ \hat{\beta}_{j}+q_{0.975} \times se(\hat \beta_j)\right) = 0.95 $$ ------------------------------------------------------------------------ ```{r, fig.align='center', fig.height=5.5, fig.width=9} #| code-fold: true #| label: fig-convinterval #| fig-cap: "Conficence Interval" # Plot of 95% confidence interval b=2.3 # beta hat sd=0.96 # se(beta hat) a=0.05 # significance level# beg=round(b-4*sd) end=round(b+4*sd) maxh=dnorm(b, b, sd) x <- seq(beg, end, 0.01) cvl <- round( qnorm(a/2, 0, 1), digits = 3 ) cvr <- round( qnorm(1-a/2, 0, 1), digits = 3) plot(x = x, y = dnorm(x, b, sd), type = "l", col = "black", lwd = 2, main = bquote( .((1-a)*100) ~ "% Convidence Interval; e.g.," ~ hat(beta) ~ "=" ~ .(b) ~ ", " ~ se(hat(beta)) ~ "=" ~ .(sd)), ylab="", xlab="", cex.lab = 1, cex.main = 1.5, axes=FALSE) axis(side=1, at=beg:end, pos = c(0,0)) axis(side=2) # critical region in left tail polygon(x = c(beg, seq(beg, cvl*sd + b, 0.01), cvl*sd + b), y = c( 0, dnorm(seq(beg, cvl*sd + b, 0.01), b, sd), 0 ), col = 'cornflowerblue') # critical region in right tail polygon(x = c(cvr*sd + b, seq(cvr*sd + b, end, 0.01), end), y = c( 0, dnorm(seq(cvr*sd + b, end, 0.01), b, sd), 0 ), col = 'cornflowerblue') # confidence region polygon(x = c(cvl*sd + b, seq(cvl*sd + b, cvr*sd+b, 0.01), cvr*sd+b), y = c( 0, dnorm(seq(cvl*sd + b, cvr*sd+b, 0.01), b, sd), 0 ), col = 'gray99') # Interval segments( x0=b-abs(cvl)*sd, y0=0.0, x1=b+abs(cvl)*sd, y1=0.0, lwd=5, col = "gray") rug( c(b-abs(cvl)*sd, b+abs(cvl)*sd ), ticksize = 0.08, lwd = 6, col = "darkgray") rug( b, ticksize = 0.05, lwd = 1.5) # left text(beg, maxh*0.3, labels = bquote( .(a/2) ), cex = 1) arrows(beg, maxh*0.26, b+cvl*sd*1.18 , maxh*0.02, length = 0.08) # right text(end, maxh*0.3, labels = bquote( .(a/2) ), cex = 1) arrows(end, maxh*0.26, b+cvr*sd*1.18, maxh*0.02, length = 0.08) text(end, maxh*0.85, labels = bquote( .(1-a) ), cex = 1) arrows(end, maxh*0.81, b+(end-b)*0.1, maxh*0.6, length = 0.08) # left text( beg+(b-beg)/4, maxh*0.45, labels = bquote( hat(beta) ~ "-" ~.(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b-abs(cvl)*sd)), cex = 1) arrows(beg+(b-beg)/4, maxh*0.4, cvl*sd+b, maxh*0.01, length = 0.08) # right text(end-(end-b)/4, maxh*0.45, labels = bquote( hat(beta) ~ "+" ~ .(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b+cvr*sd)), cex = 1) arrows(end-(end-b)/4, maxh*0.4, cvr*sd+b, maxh*0.01, length = 0.08) # beta hat text(b-0.1, maxh*0.2, labels = bquote(hat(beta) ~ "=" ~ .(b)), cex = 1.4) arrows(b-0.1, maxh*0.16, b, maxh*0.025, length = 0.08) text(-1.0, 0.35, labels = expression("Estimated distribution of" ~ hat(beta)),cex = 1) arrows(-1.1, 0.33, 1.11, 0.2, length = 0.08) ``` ------------------------------------------------------------------------ ```{r, fig.align='center', fig.height=5.5, fig.width=9} #| code-fold: true #| label: fig-convinterval1 #| fig-cap: "Conficence Interval" # Plot of 99% confidence interval b=2.3 # beta hat sd=0.96 # se(beta hat) a=0.01 # significance level# beg=round(b-4*sd) end=round(b+4*sd) maxh=dnorm(b, b, sd) x <- seq(beg, end, 0.01) cvl <- round( qnorm(a/2, 0, 1), digits = 3) cvr <- round( qnorm(1-a/2, 0, 1), digits = 3) plot(x = x, y = dnorm(x, b, sd), type = "l", col = "black", lwd = 2, main = bquote( .((1-a)*100) ~ "% Convidence Interval; e.g.," ~ hat(beta) ~ "=" ~ .(b) ~ ", " ~ se(hat(beta)) ~ "=" ~ .(sd)), ylab="", xlab="", cex.lab = 1, cex.main = 1.5, axes=FALSE) axis(side=1, at=beg:end, pos = c(0,0)) axis(side=2) # critical region in left tail polygon(x = c(beg, seq(beg, cvl*sd + b, 0.01), cvl*sd + b), y = c( 0, dnorm(seq(beg, cvl*sd + b, 0.01), b, sd), 0 ), col = 'cornflowerblue') # critical region in right tail polygon(x = c(cvr*sd + b, seq(cvr*sd + b, end, 0.01), end), y = c( 0, dnorm(seq(cvr*sd + b, end, 0.01), b, sd), 0 ), col = 'cornflowerblue') # confidence region polygon(x = c(cvl*sd + b, seq(cvl*sd + b, cvr*sd+b, 0.01), cvr*sd+b), y = c( 0, dnorm(seq(cvl*sd + b, cvr*sd+b, 0.01), b, sd), 0 ), col = 'gray99') # Interval segments( x0=b-abs(cvl)*sd, y0=0.0, x1=b+abs(cvl)*sd, y1=0.0, lwd=5, col = "gray") rug( c(b-abs(cvl)*sd, b+abs(cvl)*sd ), ticksize = 0.08, lwd = 6, col = "darkgray") rug( b, ticksize = 0.05, lwd = 1.5) # left text(beg, maxh*0.3, labels = bquote( .(a/2) ), cex = 1) arrows(beg, maxh*0.26, b+cvl*sd*1.18 , maxh*0.02, length = 0.08) # right text(end, maxh*0.3, labels = bquote( .(a/2) ), cex = 1) arrows(end, maxh*0.26, b+cvr*sd*1.18, maxh*0.02, length = 0.08) text(end, maxh*0.85, labels = bquote( .(1-a) ), cex = 1) arrows(end, maxh*0.81, b+(end-b)*0.1, maxh*0.6, length = 0.08) # left text( beg+(b-beg)/4, maxh*0.45, labels = bquote( hat(beta) ~ "-" ~.(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b-abs(cvl)*sd)), cex = 1) arrows(beg+(b-beg)/4, maxh*0.4, cvl*sd+b, maxh*0.01, length = 0.08) # right text(end-(end-b)/4, maxh*0.45, labels = bquote( hat(beta) ~ "+" ~ .(abs(round(qnorm(a/2),digits=2))) %*% se(hat(beta)) ~ " = " ~ .(b+cvr*sd)), cex = 1) arrows(end-(end-b)/4, maxh*0.4, cvr*sd+b, maxh*0.01, length = 0.08) # beta hat text(b-0.1, maxh*0.2, labels = bquote(hat(beta) ~ "=" ~ .(b)), cex = 1.4) arrows(b-0.1, maxh*0.16, b, maxh*0.025, length = 0.08) text(-1.0, 0.35, labels = expression("Estimated distribution of" ~ hat(beta)),cex = 1) arrows(-1.1, 0.33, 1.11, 0.2, length = 0.08) ``` ------------------------------------------------------------------------ ### Interpretation of the confidence interval - A confidence interval is constructed in the way that **in repeated samples it will cover the population regression coefficient (the true value) in 95% of the cases** - Hence, the bounds of the a confidence interval are *random* - We further have an important *relationship* between *confidence intervals* and *hypotheses tests*: $$ a_j \notin interval \ \Rightarrow \ \text{ reject } H_0: \beta_j=a_j \, \text{ in favour of } H_1: \beta_j \neq a_j $$ - This means that if a specific value is not in the confidence interval, we can reject the null hypothesis in a two-tailed test that $\beta_j$ is equal to this value - Hence, calculating confidence interval is an **alternative way to carry out parameter tests** - Likewise, this equivalence to parameter tests implies that violations of assumptions MLR.1 -- MLR.5 (MLR.6) may lead to *invalid* confidence intervals ------------------------------------------------------------------------ #### Simulating Confidence Intervals To demonstrate the *interpretation* of confidence intervals from above, we *simulate* the following simple regression model $$ y = 3 + 1 \times x + u $$ with a sample size of *n*=100 and calculate the estimate for $\beta_1$ -- with a true value of one -- and the estimated corresponding 95% confidence interval with $$ lower \ bound \ = \ \hat \beta_1 - q_{0.975} \times se(\hat \beta_1) \\ \ upper \ bound \ = \ \hat \beta_1 + \underbrace {q_{0.975}}_{1.96} \times se(\hat \beta_1) $$ - This whole procedure is repeated 500 times. We plot the first 100 of the calculated 500 confidence intervals and look how many of them cover the true value of $\beta$=1 - **Result**: Regarding the whole 500 simulated intervals, 94.4% of them cover the true value of one. From the first 100 plotted, 95 cover and 5 don't cover the true value. This is in a nearly perfect accordance with the expectations ```{r, message=FALSE} #| code-fold: true library(AER) # we need coeftest() set.seed(1) # set seed rep <- 500 # number of calculated confidence intervals n <- 100 # sample size of regressions nrep <- 100 # first nrep confidence intervals of the simulated rep, which are plotted # initialize vectors of lower and upper interval boundaries lower <- numeric(rep) upper <- numeric(rep) # loop of running regressions / estimation / CI for(i in 1:rep) { # model x <- rnorm(n, mean=5, sd=5) u <- rnorm(n) y = 3 + x + u # regression ols <- lm(y ~ x) # CIs ctest <- coeftest(ols) lower[i] <- ctest[2,1] - qt(1-0.05/2, n) * ctest[2,2] upper[i] <- ctest[2,1] + qt(1-0.05/2, n) * ctest[2,2] } ``` ```{r} #| code-fold: true # merge vectors of lower and upper confidence interval bounds into a matrix with two columns CIs <- cbind(lower, upper) # fraction of how many confidence intervals cover true value of 1 print( paste( mean( CIs[, 1] <= 1 & 1 <= CIs[, 2] )*100, "% of the simulated values are covered by the confidence interval" ) ) # identify intervals not covering true value of 1 for the first nreps cases ID <- which( !(CIs[1:nrep, 1] <= 1 & 1 <= CIs[1:nrep, 2]) ) print( paste("The following CI does not cover the true value:", ID) ) ``` ```{r, fig.align='center', fig.width=9.5, fig.height=6} #| code-fold: true #| label: fig-convIntSim #| fig-cap: "Simulated confidence intervals" # initialize the plot for first 100 CIs plot(0, xlim = c(1, nrep), ylim = c( 1 - sd(CIs[,1])*4.3, 1 + sd(CIs[,2])*4.3), xlab = "Confidence intervals in repeated random samples", ylab = expression(beta[1]~"="~1), main = bquote("Simulating" ~ .(nrep) ~ "confidence intervals;" ~ .(length(ID)) ~ "out of" ~ .(nrep) ~ "do not cover the true value 1") ) # set up color vector colors <- rep( "grey60", nrep) # color grey for all 100 CIs colors[ID] <- "salmon" # color salman for CIs not covering the true value # draw reference line at meta1=1 abline(h = 1) # add first nrep=100 confidence intervals with corresponding x,y coordinates for(j in 1:nrep) { lines(x = c(j,j), y = c(CIs[j, 1], CIs[j, 2]), col = colors[j], lwd = 3.6) } ``` ------------------------------------------------------------------------ ### Confidence Intervals -- Example ```{r, message=FALSE} #| comment: " " library(AER); library(wooldridge); data("wage1") # we once more estimate the our wage model summary( out <- lm(lwage ~ educ + tenure + exper, data = wage1) ) ``` ------------------------------------------------------------------------ ```{r} #| comment: " " # Now we calculate the 95% confidence intervals for all parameters with the R function confint() confint(out) ``` As we can see, 0.1 is contained in the 95% CI for `educ.` Therefore, a 10% effect of one more year of education cannot be ruled out with a significance level of 5%. This is the same result as we get earlier with a corresponding t-test ```{r} #| comment: " " # Calculating the 99% confidence intervals for all parameters confint(out, level = 0.99) ``` ------------------------------------------------------------------------ ## Testing multiple linear restrictions To give a motivation for this section it is best to start with an example - Suppose, we want to investigate whether the salary of baseball player depend on their achievements $$ \begin{align*} log(salary) = \beta_0 &+ \beta_1\cdot years + \beta_2 \cdot gamesyr \\ &+ \beta_3 \cdot bavg + \beta_4\cdot hrunsyr + \beta_5\cdot rbisyr + u \end{align*} $$ - `years` = number of years as an active player - `gamesyr` = number of games played per year - `bavg` = batting average - `hrunsyr` = homeruns per year - `rbisyr` = runs batted in per year - Data set *mlb1* from the *Wooldridge* package, n = 353 major league baseball players - The *last three variables are measures of performance*. We want to test whether the salaries of the players depend on these, *given* the years as player and the frequency of run-outs - Our null hypothesis is $H_0: \beta_3=0,\ \beta_4=0, \ \beta_5=0$ against the $H_1:$ at least one parameter ≠ 0 - As we have to deal with **multiple** *restrictions* in this case, our usual t-test is not applicable any more ------------------------------------------------------------------------ - One possible *general* approach is as follows -- **Wald or F Tests** - We first estimate the model without imposing the restrictions; this is the **unrestricted** version - After that, we estimate the model again, this time enforcing the restrictions; this is the **restricted** version (actually, this step is not really required for a Wald test, but it is easier to explain the test idea this way) - In the case of *zero restrictions*, like $\beta_j=0, \, \ldots$, (called **exclusion restrictions**), this is accomplished very easily by simply dropping the corresponding variables from the model - In the case of *more complicated restrictions* involving *several* parameters, like $\beta_i - \beta_j=0$ or $\beta_i+\beta_j=1$, a *reparameterization* of the model is necessary: For the first case we could define a new parameter $\theta=\beta_i - \beta_j$ and plug in for $\beta_i=\theta+\beta_j$ into the model and collect the variables accordingly. Afterwards, we drop the variable corresponding to $\theta$ - Clearly, imposing restrictions on the model leads to a poorer fit. The **test idea** is that the $H_0$ *has to be rejected if the fit of the model worsens too much*. In this case, the data, along with the model of course, are apparently not compatible with the null hypothesis - Hence, we are looking for a *test statistic which appropriately measures the decrease in the model's fit caused by the imposition of the restrictions* and *from which we know the distribution* ------------------------------------------------------------------------ #### The F-staistic - The following test statistic does the job $$ F \ = \ \frac{\left(SSR_r-SSR_{ur} \right) / m}{SSR_{ur} /(n-k-1)} \ \sim \ F(m, n-k-1) $$ {#eq-Fstat} Thereby $SSR \equiv \sum\nolimits _{i=1}^n \hat u_i^2$ is defined for both the restricted and unrestricted model - This statistic *measures the percentage decease in the fit* (the percentage ***in**crease* in *SSR*), corrected for the *degrees of freedom* of the *nominator* -- *m* is number of restrictions -- and of the *denominator* -- $(n-k-1)$ are the degrees of freedom in calculating $\hat \sigma_{ur}^2$ - There is an equivalent version of this test statistic in terms of the $R^2$ instead of *SSR*. But this version is only applicable if the dependent variable is the same for both the restricted and unrestricted model $$ F \ = \ \frac{(R_{ur}^2 - R_{r}^2) / m}{(1-R_{ur}^2) / (n-k-1)} \ \sim \ F(m, n-k-1) $$ {#eq-Fstat1} - This measures the decrease in the model's fit in terms of a decrease in the $R^2$ - Note, for both versions the nominator is *always positive*, as a (binding) restriction always reduces the model's fit - One can show that the test statistic introduced above is *F* *distributed* with $m$ and $(n-k-1)$ degrees of freedom, as **both the nominator as well as the denominator are** (scaled) $\chi^2$ **distributed random variables**, which is the definition of a F-distribution. This test statistic is simply called ***F-statistic*** - **Important Remark**: The t-statistic introduced above is a special case of the F-statistic if we only test *one* restriction, i.e., $m=1$. Actually, in this case, the F-statistic is simply the *square* of the t-statistic. Therefore, any (two-tailed) t-test can also be formulated as a F-test ------------------------------------------------------------------------ #### Wald statistic - As the denominator of the F-statistic is simply the estimate of $\sigma_{ur}^2$, which can be *consistently* estimated by OLS, there is an *asymptotic version* of the F-statistic, which we call *W*. Thereby, $\hat \sigma_{ur}^2$ in *F* is replaced by $\sigma_{ur}^2$ (assuming, the number of observations is large enough so that the estimate for $\sigma_{ur}^2$ has already converged to the true value) and *F* is further multiplied by *m* $$ W \ = \ \frac{\left(SSR_r-SSR_{ur} \right)}{\sigma_{ur}^2} \ \sim \ \chi^2(m) $$ {#eq-Wstat} - *W*, which is called **Wald statistic**, is $\chi^2$ distributed as the denominator is no random variable any more but simply a scaling variable required so that *W* is actually $\chi^2$ distributed - Many econometric software packages let the user choose whether the *F* or $\chi^2$ version of the test should be used ------------------------------------------------------------------------ ### Carrying out the test - Carrying out a F-test is basically the same the procedure as with a t-test - We postulate a null hypothesis - We determine the critical value as 1 minus the 0.95 quantil of the corresponding *F* or $\chi^2$ distribution - We calculate the test statistic and check whether it falls into the rejection region, i.e., whether the test statistic is larger than the critical value - Alternatively, we can calculate the p-value corresponding to the test statistic - The graphics below illustrates this procedure - Note, that the *F* and $\chi^2$ distribution are very similar in this case because the denominator's degrees of freedom is quite high (347). The numerator's degrees of freedom is 3 -- we test for three restrictions; - Both test statistics correspond to the *baseball player example* presented above. Further note the different scaling of the two distributions which amounts to 3 - As shown after the figures, the F-statistic and W-statistic for the baseball player example are 9.55 and 28.65, respectively. Is the $H_0$ rejected? ------------------------------------------------------------------------ ```{r, fig.align='center', fig.height=6, fig.width=9, fig.cap="4.9"} #| code-fold: true #| label: fig-Fdistribution #| fig-cap: "F-distribution" # Plot the F(3,100) Distribution on interval [-5,5] df1 <- 3 df2 <- 353-6 end <- 5 a <- 0.05 hmax <- df( qf(0.3, df1, df2), df1, df2) # not really hmax t <- seq(0, 5, 0.01) cv <- round( qf(1-a, df1, df2), digits = 2 ) plot(x = t, y = df(t, df1, df2), type = "l", col = "black", lwd = 2, main = bquote("Test with F(" ~ .(df1) ~ ", " ~ .(df2) ~ ") distribution and" ~ alpha ~ "=" ~ .(a)), ylab="", xlab="", cex.lab = 1, cex.main = 1.5, axes=FALSE) axis(side=1, at=0:end, pos = c(0,0) ) axis(side=2) # critical region in right tail polygon(x = c(cv, seq(cv, 5, 0.01), 5), y = c( 0, df(seq(cv, 5, 0.01), df1, df2), 0 ), col = 'cornflowerblue') text(end*0.75, hmax/2, labels = bquote(alpha ~ "=" ~ .(a)),cex = 1) arrows(end*0.75, hmax/2.2, end*0.65, hmax*0.05, length = 0.1) text(end*0.15, hmax/2.8, labels = bquote( qf[0.95] ~ " = " ~ .(cv)), cex = 1) arrows(end*0.15, hmax/3.2, cv*0.99, hmax*0.01, length = 0.08) ``` ------------------------------------------------------------------------ ```{r, fig.align='center', fig.height=6, fig.width=9, fig.cap="4.10"} #| code-fold: true #| label: fig-Chidistribution #| fig-cap: "Chi-distribution" # Plot the ChiSq(3) Distribution on interval [-5,5] df <- 3 a <- 0.05 end <- df*5 hmax <- dchisq( qchisq(0.3, df), df) # not really hmax t <- seq(0, end, 0.01) cv <- round( qchisq(1-a, df), digits = 2 ) plot(x = t, y = dchisq(t, df), type = "l", col = "black", lwd = 2, main = bquote("Test with ChiSq(" ~ .(df) ~ ") distribution and" ~ alpha ~ "=" ~ .(a)), ylab="", xlab="", cex.lab = 1, cex.main = 1.5, axes=FALSE) axis(side=1, at=0:end, pos = c(0,0) ) axis(side=2) # critical region in right tail polygon(x = c(cv, seq(cv, end, 0.01), end), y = c( 0, dchisq(seq(cv, end, 0.01), df), 0 ), col = 'cornflowerblue') text(end*0.75, hmax/2, labels = bquote(alpha ~ "=" ~ .(a)),cex = 1) arrows(end*0.75, hmax/2.2, end*0.65, hmax*0.05, length = 0.1) text(end*0.15, hmax/2.8, labels = bquote( qchisq[0.95] ~ " = " ~ .(cv)), cex = 1) arrows(end*0.15, hmax/3.2, cv*0.99, hmax*0.01, length = 0.08) ``` ------------------------------------------------------------------------ ### Baseball example continued ##### **We estimate the baseball player model; unrestricted version** ```{r, message=FALSE} #| comment: " " library(AER); library(wooldridge); data("mlb1") unr <- lm( log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr, data = mlb1) summary(unr) ``` - Note that **none** of *performance variables* is statistically significant ------------------------------------------------------------------------ ##### **We estimate the baseball player model, this time in its restricted version** ```{r} #| comment: " " r <- lm( log(salary) ~ years + gamesyr, data = mlb1) summary(r) ``` ------------------------------------------------------------------------ ##### **We calculate the F-statistic from @eq-Fstat** ```{r} #| comment: " " SSRur <- sum( unr$residuals^2 ) SSRr <- sum( r$residuals^2 ) df <- summary(unr)$df[2] Fstat <- (SSRr - SSRur)/SSRur*df/3 Fstat # 95% critical value qf(0.95, 3, df) # p-value 1 - pf(Fstat, 3, df) ``` - The $H_0$ that *none* of the performance variables have any impact on the salary of the players is strongly rejected, although the variables on its own right are not significant -- multicollinearity problem - Compare @fig-Fdistribution ------------------------------------------------------------------------ ##### **We carry out the same test with `lht()` (for linearHypothesis), contained in the `AER` package** - Applying this procedure we notice that actually, it is not really necessary to estimate the restricted model to carry out the test. We only need the unrestricted model, as it was the case with the simple t-test ```{r} #| comment: " " lht(unr, c("bavg=0", "hrunsyr=0", "rbisyr=0")) ``` - We get the very same results as with the F-test carried out by hand ------------------------------------------------------------------------ ##### **We calculate the W-statistic from @eq-Wstat; the chi squared version** ```{r} #| comment: " " sig2ur <- SSRur/df Wstat <- (SSRr - SSRur)/sig2ur Wstat # 95% critical value qchisq(0.95,3) # p-value 1 - pchisq(Wstat, 3) ``` ------------------------------------------------------------------------ ##### **The same test with `lht()`, contained in the `AER` package; chi squared version** ```{r} #| comment: " " lht(unr, c("bavg=0", "hrunsyr=0", "rbisyr=0"), test = "Chisq") ``` - We get the same result as doing the testing by hand and nearly the same result as with the F-statistic version - Compare @fig-Chidistribution ------------------------------------------------------------------------ ### "the" F-statistic of a regression model {#sec-theFstat} A special case of the F-test is a test of the **overall significance** of a regression with the $$ H_0: \beta_1 = \beta_2 = \ldots = \beta_k = 0 $$ This null hypothesis states that the explanatory variables *overall* are *not useful* in explaining the dependent variable - This test of *overall significance* is routinely calculated and reported in most regression packages as "the" F-statistic; the null hypothesis is usually overwhelmingly rejected - If this $H_0$ cannot be rejected, the model is of *no value* in explaining the dependent variable ------------------------------------------------------------------------ ### Additional examples As a last **example** in this chapter we will re-estimate our wage equation $$ log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 tenure + u $$ - We will test whether the effect of one additional year of education has a threefold effect than the effect of one additional year of tenure; $H_0: \beta_1 = 3 \beta_3$ - As this hypothesis involves only one restriction, a t-test could be used. However, the restriction affects two parameters, therefore, a reparameterization of the model with a variable transformation would be necessary to implement this restriction - Using the R function `lht()` (*for linearHypothesis*) of the `car` and `AER` package implements this test very easily ------------------------------------------------------------------------ ```{r, message=FALSE} #| comment: " " library(AER); library(wooldridge); data("wage1") # we first estimate the model summary( out <- lm(lwage ~ educ + tenure + exper, data = wage1) ) ``` - Note "the" F-statistic! ------------------------------------------------------------------------ - We test the H0: beta1 = 3 beta3 ```{r} #| comment: " " lht( out, "educ = 3*tenure") ``` ------------------------------------------------------------------------ - We additionally test whether beta1 = 0.1 - If we have several restrictions, the restrictions must be put into a character vector with the 'c()' function ```{r} #| comment: " " lht( out, c("educ = 3*tenure", "educ=0.1") ) ``` - These two restrictions together on the model are clearly not compatible with the data anymore ------------------------------------------------------------------------ ## Tests in matrix notation :::: {.callout-caution collapse="false" icon="false"} ## Wald test statistic with general linear hypothesis in matrix notation Every set of linear hypothesis can be stated as: $$ \mathbf R \boldsymbol{\hat{\beta}} = \mathbf c $$ {#eq-restrict} Thereby, the restriction matrix $\mathbf R$ has dimensions *m* x (*k*+1), with *m* being the number of restrictions. Consider the following example: $$ \underbrace{\left[\begin{array}{cccc} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{array}\right]}_{R}\left[\begin{array}{c} \hat{\beta}_{0} \\ \hat{\beta}_{1} \\ \hat{\beta}_{2} \\ \hat{\beta}_{3} \end{array}\right] = \underbrace{\left[\begin{array}{c} 0 \\ 0 \end{array}\right]}_{c} $$ Multiplying out we see that, in this case, we test whether $\hat \beta_2=0$ *and* $\hat \beta_3=0$. *We have the following important Theorem*: ::: {#thm-distrib_quadratic_form} ## Distribution of a quadratic form Let $\mathbf z$ be a $N(\mathbf 0, \mathbf V)$ normal distributed random vector with *h* elements. Then the quadratic form $$ \mathbf z' \mathbf V^{-1} \mathbf z $$ is distributed as $$ \mathbf z' \mathbf V^{-1} \mathbf z \ \sim \ \chi^2(h) $$ (see Appendix A.6 and A.9). ::: ------------------------------------------------------------------------ - Assuming that $\sigma^2$ is a *known* constant and applying covariance formula from @eq-C_covmatrix_lin_combination to the restriction vector $(R \boldsymbol{\hat{\beta}} - \mathbf c)$, we get the covariance matrix of this random vector: $$ \sigma^{2} \mathbf{R} \left( \mathbf{X}' \mathbf{X} \right)^{-1} \mathbf R' $$ {#eq-eq-restrictCovMat} - As $\boldsymbol{\hat \beta}$ is asymptotically normally distributed (or normally distributed in small samples if $\mathbf u$ is normally distributed), provided $H_0$ is true, the restriction vector is $$ (R \boldsymbol{\hat{\beta}} - \mathbf c) \ \sim \ N(\mathbf 0, \sigma^{2} \mathbf{R} \left( \mathbf{X}' \mathbf{X} \right)^{-1} \mathbf R') $$ - Applying @thm-distrib_quadratic_form from above we conclude that the following must hold: $$ W \ = \ \underbrace{(\mathbf{R} \hat{\boldsymbol{\beta}}-\mathbf{c})^{\prime}}_{z^{\prime}} \underbrace{\left(\sigma^{2} \mathbf{R}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{R}^{\prime}\right)^{-1}}_{V^{-1}} \underbrace{(\mathbf{R} \hat{\boldsymbol{\beta}}-\mathbf{c})}_{z} \ \sim \ \chi^{2}(m) $$ {#eq-WaldStatMat} This Wald test statistic is a *weighted* sum (by the inverse of the covariance matrix) of the squared distances (and cross products) between the hypothesized values $\mathbf c$ and the actual estimation outcomes $\mathbf R \boldsymbol{\hat \beta}$. If this normed distance measure is too large, $H_0$ is rejected. For *m* = 1 (only one restriction), $\sqrt W$ reduces to a simple *t*-statistic. ------------------------------------------------------------------------ If $\sigma^2$ is *unknown*, it has to be estimated by $\hat \sigma^2={\sum_{i=1}^n \hat u_i^2} \ / \ {(n-k-1)}$. - With the help of the above theorem, it can be shown that this estimator (after scaling with $\sigma^2$) is distributed as $\chi^2(n-k-1)$ - Hence, in this case, the Wald-statistic @eq-WaldStatMat is the ratio of two $\chi^2$, independently distributed random variables and therefore is *F*-distributed after appropriate scaling (see Appendix A.7): $$ F \ = \ \frac{(\mathbf R \boldsymbol{\hat{\beta}} - \mathbf c)' \left(\hat\sigma^{2} \mathbf R\left(\mathbf X' \mathbf X\right)^{-1} \mathbf R'\right)^{-1}(\mathbf R \boldsymbol{\hat{\beta}} - \mathbf c)}{m} \ \sim \ F(m,n-k-1) $$ {#eq-FStatMat} Note, in *small* samples this formula is valid only if $\mathbf u$ is normally distributed. In *large* samples, @eq-FStatMat times *m* converges to @eq-WaldStatMat and so the two test statistics are asymptotically equivalent. - Furthermore, it can be shown that the formula above is equivalent to: $$ F \ = \ \frac{\left(\hat{\mathbf{u}}_{r}^{\prime} \hat{\mathbf{u}}_{r}-\hat{\mathbf{u}}_{ur}' \hat{\mathbf{u}}_{ur} \right) / m}{\hat{\mathbf{u}}'_{ur} \hat{\mathbf{u}}_{ur} /(n-k-1)} \ \sim \ F(m, n-k-1) $$ {#eq-FStatMatAlternative} The interpretation of this test-statistic is straightforward: If the percentage decrease of the fit (measured by *SSR*) of the restricted model is too large compared to the model without imposing the restrictions, the restricted model is rejected. ::::

3.1 Topics of this chapter

3.2 Statistical inference in general

3.3 Testing hypotheses about a single population parameter

3.3.1 Testing procedure for a single parameter

3.3.2 Example for tests - wage equation

We first estimate the model – in this case for explaining the received wage of persons

We want to test whether tenure has any effect on lwage

We want to test whether tenure has really a positive effect on lwage

We want to test whether educ could have an effect on lwage of at least 10%:

3.4 Types of errors and power of statistical tests

3.5 p-value

3.6 General remarks on parameter testing

3.7 Confidence intervals

3.7.1 Interpretation of the confidence interval

Simulating Confidence Intervals

3.7.2 Confidence Intervals – Example

3.8 Testing multiple linear restrictions

The F-staistic

Wald statistic

3.8.1 Carrying out the test

3.8.2 Baseball example continued

We estimate the baseball player model; unrestricted version

We estimate the baseball player model, this time in its restricted version

We calculate the F-statistic from Equation 3.8

We carry out the same test with lht() (for linearHypothesis), contained in the AER package

We calculate the W-statistic from Equation 3.10; the chi squared version

The same test with lht(), contained in the AER package; chi squared version

3.8.3 “the” F-statistic of a regression model

3.8.4 Additional examples

3.9 Tests in matrix notation

We want to test whether `tenure` has any effect on lwage

We want to test whether `tenure` has really a positive effect on lwage

We want to test whether `educ` could have an effect on lwage of at least 10%:

We carry out the same test with `lht()` (for linearHypothesis), contained in the `AER` package

The same test with `lht()`, contained in the `AER` package; chi squared version