This chapter deals with statistical inference in the regression model, particularly with
Hypothesis tests about population parameters
Construction of confidence intervals
For both we need the distribution of the OLS estimators for repeated samples
The OLS estimators \hat \beta_j and \hat \sigma^2 depend on the particular sample and are therefore random variables
We already know their expected values and their variances
However, for hypothesis tests we need to know their distribution
In order to derive their distribution we need additional assumptions, at least for small samples
MLR.6: Errors are normally distributed
u_i \ \sim \ N(0, \sigma^2)
which implies that the depended variable y is normally distributed as well, conditional on the values of the explanatory variables in \mathbf x
The error term u_i is interpreted as the sum of many different unobserved factors
According to the Central Limit Theorem (CLT, see Theorem A.3), sums (means) of many independent factors tends to be normally distributed
Problems:
Are the conditions for the CLT met? In particular:
How many different factors? Is the number large enough?
Possibly very heterogeneous distributions of individual factors
How independent are the different factors from each other?
The normality of the error term is an empirical question
At least the error distribution should be “close” to normal
In many cases, normality is questionable or even impossible by definition
Examples where normality cannot strictly hold:
Wages (non-negative; also: minimum wage)
Number of arrests (takes on a small number of integer values)
Employed/not employed (indicator, respectively, binary dependent variable, takes on only 1 or 0)
In some cases, normality can be achieved through transformations of the dependent variable (e.g. use log(wage) instead of wage)
Very important: Under certain regularity conditions regarding the data and u_i the assumption of normality can be replaced by a large sample size. In that case the error term needs not to be normally distributed.
The following terminology is often used:
\underbrace{\text{MLR.1 - MLR.5}}_{\text{Gauss-Markov assumptions}} \quad \quad \underbrace{\text{MLR.1 - MLR.6}}_{\text{Classical linear model (CLM) assumptions}}
Furthermore, we have the following important theorem:
Theorem 3.1 (Normal sampling distributions) Under assumptions MLR.1 - MLR.6
That means that, given the assumptions above, the OLS estimator \hat \beta_j is unbiased and normally distributed around the true parameter with a variance that was derived earlier.
That means that if the standardization in Equation 3.2 is done using the estimatedstandard deviation of \hat \beta_j, the normal distribution from Equation 3.2 is replaced by a t-distribution
The estimatedstandard deviation of \hat \beta_j is called standard error
The standard error, se(\hat \beta_j), is calculated with \hat \sigma^2 instead of the true \sigma^2 in Equation 2.37. Therefore, se(\hat \beta_j) is itself a random variable which is usually \chi^2 distributed
In this case, we actually have a ratio of two random variables in Equation 3.3 – a normal over a \chi^2 distributed random variable, which usually leads to a t-distribution. That’s our heuristic proof of Theorem 3.2
A very important special case arises if we test the particular null hypothesis H_0 : \beta_j=0
This means that we hypothesize that the true population parameter is equal to zero, i.e. there is no effect ofx_jonyafter controlling for the other independent variables
This is the so called Significance Test. In this case, presuming the H_0 is true, Equation 3.3 reduces to thet-statistic
Generally, the t-statistic is a standardized measure of the distance between the estimated parameter and its hypothesized value (which is zero or a_j)
In particular, it measures how many estimated standard deviations the estimated coefficient is away from theH_0 (0 or a_j)
The basic test idea is the following: The farther the estimated coefficient is away from the hypothesized value zero (or a_j), the less likely it is that the null hypothesis holds true
But what does “far away” from zero (or a_j) mean? This is measured by the t-statistic above and assessed with the help of the distribution of the t-statistic
If the H_0 is true, we know from Theorem 3.2 that the distribution of the t-statistic is the t-distribution (so, the name of the t-statistic is not accidentally)
If the degrees of freedom of the t-distribution are greater than 60 (many observations), the t-distribution is practical identical to a standardized N(0,1) normal distribution
3.3.1 Testing procedure for a single parameter
How is a test procedure actually carried out?
From Theorem 3.2, we know the distribution of the test statistic
We carry out the estimation of \beta_j and se(\beta_j) and calculate the test statistic Equation 3.5
We want to test a hypothesis H_0 about the true value of the estimated parameter – actually we want to reject the H_0
The test statistic shows how far away the estimate \hat \beta_j is from the hypothesized value
To asses whether the test statistic is far away enough to reject the H_0 we construct a rejection region so that, if the null hypothesis istrue, the test statistic only accidentally falls into this rejection region, for instance in only 5% of the cases. This small probability is freely chosen and called significance level
The significance level\alpha is the probability of making a so called type I error or \alpha-error, which is the probability to falsely reject a true hypothesis accidentally (only due to sampling errors). In most cases \alpha=0.05 (5%) is chosen, but \alpha=0.01 (1%) or \alpha=0.1 (10%) are also quite common
The boundary points of the rejection region are called critical values. These are the values which are exceeded (deceeded) in x% of the cases. Formally, we are looking for the corresponding quantile of the test statistic’s distribution
Having constructed the rejection region of the test, we reject the null hypothesis H_0: \beta_j=a_j in favor of the alternative hypothesis if the test statistic falls into this rejection region. In this case the estimated value is “too far away” from the hypothesized value
There are two types of alternative hypothesis H_1:
H_1: \beta_j \neq a_j for a two sided or two-tailed test
either H_1: \beta_j>a_j or H_1: \beta_j<a_j for a one sided or one-tailed test
For one-sided tests, it is sometimes not obvious, whether one could apply a left- or right sided one. There is a simple rule: If the t-statistic is negative, we can only apply a left-sided test, and if the t-statistic is positive, we can only apply a right-sided test. Otherwise, the H_0 could never be rejected
Two-tailed tests are much more common. One-tailed tests are used if the researcher is only interested in a particular “direction” of the parameter, e.g., with a unit-root test (a particular test for the stationarity of time series) one is only keen on whether the parameter of interest might be \geq 0
Code
# Plot the normal distribution on interval [-5,5]t<-seq(-5, 5, 0.01)cvl<-round(qnorm(0.95), digits =2)# critical value as quantile of the norm. distrib.plot(x =t, y =dnorm(t), type ="l", col ="black", lwd =2, main =expression("One-tailed test with normal distrib. and"~alpha~"= 0.05; e.g., H"[0]:~beta~"≤ 0, H"[1]:~beta~" > 0"), ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=-5:5, pos =c(0,0))axis(side=2)# critical region in right tailpolygon(x =c(cvl, seq(cvl, 5, 0.01), 5), y =c(0, dnorm(seq(cvl, 5, 0.01)), 0), col ='cornflowerblue')# texttext(3.5, 0.15, labels =expression("Rejection region,"~alpha~"= 0.05"),cex =1)arrows(3.6, 0.133, 2.3, 0.04, length =0.08)text(0, 0.1, labels =bquote(qn[0.95]~" = "~.(cvl)), cex =1)arrows(0, 0.085, cvl-0.03, 0.002, length =0.08)text(-3.5, 0.3, labels =expression("Distribution of the test statistic"),cex =1)arrows(-3.6, 0.28, -1.22, 0.2, length =0.08)
Figure 3.1: Distribution of the t-statistic - in this case a normal distribution
Code
# Plot the normal distribution on interval [-5,5]t<-seq(-5, 5, 0.01)# lower and upper critical valuescvl<-round(qnorm(0.025), digits =2); cvr<-round(qnorm(0.975), digits =2)plot(x =t, y =dnorm(t), type ="l", col ="black", lwd =2, main =expression("Two-tailed test with normal distrib. and"~alpha~"= 0.05; e.g., H"[0]:~beta~"= 0, H"[1]:~beta~" ≠ 0"), ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=-5:5, pos =c(0,0))axis(side=2)# critical region in left tailpolygon(x =c(-5, seq(-5, cvl, 0.01), cvl), y =c(0, dnorm(seq(-5, cvl, 0.01)), 0), col ='cornflowerblue')# critical region in right tailpolygon(x =c(cvr, seq(cvr, 5, 0.01), 5), y =c(0, dnorm(seq(cvr, 5, 0.01)), 0), col ='cornflowerblue')text(-4, 0.15, labels =expression("0.025"~"="~over(alpha, 2)),cex =1)arrows(-4, 0.13, -2.5, 0.026, length =0.08)text(4, 0.15, labels =expression("0.025"~"="~over(alpha, 2)),cex =1)arrows(4, 0.13, 2.5, 0.026, length =0.08)text(-0.5, 0.15, labels =bquote(qn[0.025]~" = "~.(cvl)), cex =1)arrows(-0.5, 0.135, cvl+0.03, 0.002, length =0.08)text(0.5, 0.1, labels =bquote(qn[0.975]~" = "~.(cvr)), cex =1)arrows(0.5, 0.085, cvr-0.03, 0.002, length =0.08)
Figure 3.2: Distribution of the t-statistic - in this case a normal distribution
Code
# Plot the t(4) Distribution on interval [-5,5]t<-seq(-5, 5, 0.01)# lower and upper critical valuescvl<-round(qt(0.025, 4), digits =2); cvr<-round(qt(0.975, 4), digits =2)plot(x =t, y =dt(t, 4), type ="l", col ="black", lwd =2, main =expression("Two-tailed test with t(4) distribution and"~alpha~"= 0.05; e.g., H"[0]:~beta~"= 0, H"[1]:~beta~" ≠ 0"), ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=-5:5, pos =c(0,0))axis(side=2)# critical region in left tailpolygon(x =c(-5, seq(-5, cvl, 0.01), cvl), y =c(0, dt(seq(-5, cvl, 0.01), 4), 0), col ='cornflowerblue')# critical region in right tailpolygon(x =c(cvr, seq(cvr, 5, 0.01), 5), y =c(0, dt(seq(cvr, 5, 0.01), 4), 0), col ='cornflowerblue')text(-4.5, 0.15, labels =expression("0.025"~"="~over(alpha, 2)),cex =1)arrows(-4.5, 0.13, -3.8, 0.02, length =0.08)text(4.5, 0.15, labels =expression("0.025"~"="~over(alpha, 2)),cex =1)arrows(4.5, 0.13, 3.8, 0.02, length =0.08)text(-2.2, 0.18, labels =bquote(qt[0.025]~" = "~.(cvl)), cex =1)arrows(-2.2, 0.165, cvl+0.03, 0.0015, length =0.08)text(0.55, 0.09, labels =bquote(qt[0.975]~" = "~.(cvr)), cex =1)arrows(0.65, 0.075, cvr-0.03, 0.002, length =0.08)
Figure 3.3: Distribution of the t-statistic - in this case a t-distribution
Code
# Plot the t(3) Distribution on interval [-5,5]t<-seq(-5, 5, 0.01)# lower and upper critical valuescvl<-round(qt(0.01, 4), digits =2); cvr<-round(qt(0.99, 4), digits =2)plot(x =t, y =dt(t, 4), type ="l", col ="black", lwd =2, main =expression("Two-tailed test with t(4) distribution and"~alpha~"= 0.02; e.g., H"[0]:~beta~"= 0, H"[1]:~beta~" ≠ 0"), ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=-5:5, pos =c(0,0))axis(side=2)# critical region in left tailpolygon(x =c(-5, seq(-5, cvl, 0.01), cvl), y =c(0, dt(seq(-5, cvl, 0.01), 4), 0), col ='cornflowerblue')# critical region in right tailpolygon(x =c(cvr, seq(cvr, 5, 0.01), 5), y =c(0, dt(seq(cvr, 5, 0.01), 4), 0), col ='cornflowerblue')text(-4.5, 0.15, labels =expression("0.01"~"="~over(alpha, 2)),cex =1)arrows(-4.5, 0.13, -4.1, 0.015, length =0.08)text(4.5, 0.15, labels =expression("0.01"~"="~over(alpha, 2)),cex =1)arrows(4.5, 0.13, 4.1, 0.015, length =0.08)text(-2.2, 0.18, labels =bquote(qt[0.01]~" = "~.(cvl)), cex =1)arrows(-2.2, 0.165, cvl+0.03, 0.0015, length =0.08)text(0.55, 0.09, labels =bquote(qt[0.99]~" = "~.(cvr)), cex =1)arrows(0.65, 0.075, cvr-0.03, 0.002, length =0.08)
Figure 3.4: Distribution of the t-statistic - in this case a t-distribution
3.3.2 Example for tests - wage equation
We first estimate the model – in this case for explaining the received wage of persons
# the 2.5% and 97.5% quantile of the standardized normal distribution: rejection region in both tails# Rule of Thumb for two-tailed tests: reject the H0 if |t-statistic| ≥ 2c(qnorm(alpha/2), qnorm(1-alpha/2))
[1] -1.959964 1.959964
The t-statistic is about 7, which means that the estimated coefficient for tenure is about seven times the standard deviation of the estimate away from 0.
The critical values are about \pm2.
Hence, the t-statistic clearly falls into the rejection region, thus the H_0 of no effect of tenure on lwage is overwhelming rejected. So, H_1 seems to be true, subject to a small type I error of course.
We want to test whether tenure has really a positive effect on lwage
# the t-statistic: b3 / se(b3)coef(output)[3,1]/coef(output)[3,2]
[1] 7.13307
# the 95% quantil of the standardized normal distribution: rejection region only in the right tailqnorm(1-alpha)
[1] 1.644854
The t-statistic is about 7, which means that the estimated coefficient for tenure is about seven times the standard deviations of the estimate away from 0.
The critical values is about 1.64.
Hence, the t-statistic clearly falls into the rejection region, thus the H_0 of no or a negative effect of tenure on lwage is overwhelming rejected. So, H_1 seems to be true, subject to a small type I error of course.
We want to test whether educ has an effect on lwage of at least 10%:
H0: b2 ≥ 0.1, H1: b2 < 0.1 – left-tailed (because estimated coefficient is less than 0.1)
Choosing the significance level alpha = 5%
alpha<-0.05#| comment: " "# printing the coefficients, standard errors and t-statisticcoef(output)
# the t-statistic: (b2 - 0.1) / sd(b2) (coef(output)[2,1]-0.1)/coef(output)[2,2]
[1] -1.087462
# the 5% quantil of the standardized normal distribution: rejection region in the left tailqnorm(alpha)
[1] -1.644854
The t-statistic does not fall into the rejection region.
Therefore H_0 is not rejected - so not enough evidence to be sure of H_1
However, b2 < 0.1, H_1, might be true anyway – type II error
3.4 Types of errors and power of statistical tests
There are principally two types of errors in statistical tests
Type I error or \alpha-error
This is the probability to falsely reject a trueH_0 due to sampling errors. This probability is chosen via the significance level by the researcher. So, why not choosing a very low significance level \alpha to minimize this error? In this case, rejecting the H_0 would give us more confidence regarding the rejection. But the downside is that a low significance level increases another type of error
Type II error or \beta-error
This is the probability of accepting a falseH_0. Looking at Figure 3.3 and Figure 3.4 it is pretty clear that reducing the significance level, for instance from \alpha= 0.05 to 0.02, shrinks the rejection region and therefore enlarges the acceptance region. In this case, the probability of rejecting the H_0 in repeated samples is decreased even if the H_0 is false
The power of a test is the probability to reject a false H_0. Hence, this is one minus the probability of making a type II error and one of the most important and desirable properties of a test – it is the ability of the test to discriminate between true and false hypothesis
Besides the significance level and the standardized distance of the hypothesized value from the estimated one, all factors which reduce the standard errors of the estimates increase the power of a test; most notably a large number of observations and a low multicollinearity respectively, large variations of the explanatory variables
3.5 p-value
The classical test strategy of
choosing a significance level
calculating the test statistic
looking whether the test statistic is within the rejection region
(i.e. whether |test statistic| ≥ |critical value|)
is somewhat arbitrary as the result of the test depends on the chosen significance level
As an example, suppose the observed t-statistic for a significance test, i.e. H_0: \beta_j=0 versus H_1: \beta_j≠0, equals 1.85
Using a significance level of 0.05, the H_0 cannot be rejected – the corresponding critical values, q_{0.025}, q_{0.975}, are about ±2
However, using a significance level of 0.1, the H_0 can be rejected – the corresponding critical values, q_{0.05}, q_{0.95}, are about ±1.65
So, what conclusions should be drawn?
The concept of the p-value provides an alternative to this and answers the following question: given the observed t-statistic, what is thesmallestsignificance level at which theH_0 would be rejected?
This has the following interpretation:
The p-value is the probability of observing a t-statistic as we did (or even larger in absolute terms) if theH_0 is true
Hence, a small p-value is evidence against the H_0 as the actually observed (large) t-statistic is very unlikely a result of sampling errors in this case, and vice verse
Alternatively: A small p-value is evidence against the null hypothesis because one can reject the null hypothesis even at small significance levels (small error probability)
To calculate a p-value by hand is seldom necessary (the econometric package usually does the work). It could be done the following way, which, in a sense, is a reversal of the classical test strategy
Construct the rejection region, using ± of the t-statistic as critical values
Calculate the implied significance level. This is done with the cdf() (cumulative distribution function). For our example, the implied significance level is: \operatorname {cdf}(-1.85) + (1 - \operatorname {cdf}(1.85))=0.072.
The R-code for an assumed t_{40} distribution is [1-pt(\operatorname {abs}(1,85), 40)] \times 2=0.072
The following Figure Figure 3.5 illustrates the concept of the p-value
Code
# Plot of prob.valueb=0# beta hatsd=1# se(beta hat)t=1.85df=40a=round((1-pt(abs(t), df))*2, digits =3)# prob value#a=0.036 # significance level#beg=round(-4)end=round(4)maxh=dnorm(0)x<-seq(beg, end, 0.01)cvl<--tcvr<-tplot(x =x, y =dt(x, df), type ="l", col ="black", lwd =2, main =bquote("Observed t-statistic = 1.85,"~t[40]~"distrib. => p-Value = [1 - pt(| t |, 40)]*2 = "~.(round(a, digits=3))), ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=beg:end, pos =c(0,0))axis(side=2)# critical region in left tailpolygon(x =c(beg, seq(beg, cvl, 0.01), cvl), y =c(0, dt(seq(beg, cvl, 0.01), df), 0), col ='cornflowerblue')# critical region in right tailpolygon(x =c(cvr, seq(cvr, end, 0.01), end), y =c(0, dt(seq(cvr, end, 0.01), df), 0), col ='cornflowerblue')# lefttext(beg+1, maxh*0.3, labels =bquote(over(alpha,2)~" = 1 - pt(| t |, 40) = "~.(a/2)), cex =1)arrows(beg+1, maxh*0.26, cvl*1.18 , maxh*0.045, length =0.08)# righttext(end-1, maxh*0.3, labels =bquote(over(alpha,2)~" = 1 - pt(| t |, 40) = "~.(a/2)), cex =1)arrows(end-1, maxh*0.26, cvr*1.18, maxh*0.045, length =0.08)# acceptance text(end-1, maxh*0.85, labels =bquote("(1 -"~alpha~") = "~.(1-a)), cex =1)arrows(end-1, maxh*0.81, end*0.1, maxh*0.6, length =0.08)# lefttext(-0.8, maxh*0.35, labels =bquote(.(-t)), cex =1)arrows(-0.8, maxh*0.31, -t, maxh*0.003, length =0.08)# righttext(0.8, maxh*0.35, labels =bquote(.(t)), cex =1)arrows(0.8, maxh*0.31, t, maxh*0.003, length =0.08)text(-2.8, 0.3, labels =bquote("Distribution of the test statistic,"~t[40]),cex =1)arrows(-2.8, 0.28, -1.12, 0.22, length =0.08)
Figure 3.5: Distribution of the t-statistic
3.6 General remarks on parameter testing
Guidelines for discussing economic and statistical significance
All statistical tests described in this chapter require at least MLR.1 - MLR.5 (for small samples, we additionally need MLR.6). Violations of some assumptions need special treatments
If a variable is statistically significant, discuss the magnitude of the coefficient to get an idea of its economic or practical importance
The fact that a coefficient is statistically significant does not necessarily mean it is economically or practically significant!
This is especially true for data sets with very many observations. For instance, if you have thousands of observations in a cross section data set, nearly every effect is statistically significant, even the most trivial ones
If a variable is statistically and economically important but has the “wrong” sign, the regression model might be misspecified
If a variable is statistically insignificant at the usual levels (10%, 5%, or 1%), one may think of dropping it from the regression
However, if the sample size is small, effects might be imprecisely estimated so that the case for dropping insignificant variables is less strong
3.7 Confidence intervals
Until now, we have dealt with point estimates of the unknown parameters. But according Theorem 3.1, we know that
Now, imagine we want to construct an interval estimate of the true parameter \beta_j which covers the true parameter with a probability of say, 95% (provided our model is correctly specified, of course)
To do so, we need the 0.025 and 0.975 quantils of this distribution. It turns out that this quantils are simply the ones of the standardized normal distribution – or the t distribution, if \sigma^2 is unknown and has to be estimated with only a few observations – times the estimated standard deviation of\hat \beta_j, se(\hat \beta_j)
This argument leads to the so called 95% Confidence Intervals, which covers the true parameter with a probability of 95%. As the distribution at hand is symmetric, we have q_{0.025}=-q_{0.975} and therefore
This way, we heuristically constructed an interval estimate for \beta_j, considering sample variability. Thereby, q_{0.975} denotes the 0.975 quantil of a standardized normal or a t_{n-k-1} distribution (for fewer than 60 degrees of freedom)
As the 0.975 quantil is about 2 of most cases (except for very low degrees of freedom), we can approximate the 95% confidence interval by a very simple formula, which is often used in practice
For a 99% confidence interval, use 2.6, and for a 90% confidence interval use 1.65 instead of 2. These approximations work very well for more than 60 degrees of freedom
For a more formal derivation of confidence intervals, we start with Theorem 3.2 which implies:
A confidence interval is constructed in the way that in repeated samples it will cover the population regression coefficient (the true value) in 95% of the cases
Hence, the bounds of the a confidence interval are random
We further have an important relationship between confidence intervals and hypotheses tests:
This means that if a specific value is not in the confidence interval, we can reject the null hypothesis in a two-tailed test that \beta_j is equal to this value
Hence, calculating confidence interval is an alternative way to carry out parameter tests
Likewise, this equivalence to parameter tests implies that violations of assumptions MLR.1 – MLR.5 (MLR.6) may lead to invalid confidence intervals
Simulating Confidence Intervals
To demonstrate the interpretation of confidence intervals from above, we simulate the following simple regression model
y = 3 + 1 \times x + u
with a sample size of n=100 and calculate the estimate for \beta_1 – with a true value of one – and the estimated corresponding 95% confidence interval with
lower \ bound \ = \ \hat \beta_1 - q_{0.975} \times se(\hat \beta_1) \\
\ upper \ bound \ = \ \hat \beta_1 + \underbrace {q_{0.975}}_{1.984} \times se(\hat \beta_1)
- This whole procedure is repeated 500 times. We plot the first 100 of the calculated 500 confidence intervals and look how many of them cover the true value of \beta=1
Result: Regarding the whole 500 simulated intervals, 94.4% of them cover the true value of one. From the first 100 plotted, 95 cover and 5 don’t cover the true value. This is in a nearly perfect accordance with the expectations
Code
library(AER)# we need coeftest()set.seed(1)# set seedrep<-500# number of calculated confidence intervalsn<-100# sample size of regressionsnrep<-100# first nrep confidence intervals of the simulated rep, which are plotted # initialize vectors of lower and upper interval boundarieslower<-numeric(rep)upper<-numeric(rep)# loop of running regressions / estimation / CIfor(iin1:rep){# modelx<-rnorm(n, mean=5, sd=5)u<-rnorm(n)y=3+x+u# regressionols<-lm(y~x)# CIsctest<-coeftest(ols)lower[i]<-ctest[2,1]-qt(1-0.05/2, n)*ctest[2,2]upper[i]<-ctest[2,1]+qt(1-0.05/2, n)*ctest[2,2]}
Code
# merge vectors of lower and upper confidence interval bounds into a matrix with two columnsCIs<-cbind(lower, upper)# fraction of how many confidence intervals cover true value of 1print(paste(mean(CIs[, 1]<=1&1<=CIs[, 2])*100, "% of the simulated values are covered by the confidence interval"))
[1] "94.4 % of the simulated values are covered by the confidence interval"
Code
# identify intervals not covering true value of 1 for the first nreps casesID<-which(!(CIs[1:nrep, 1]<=1&1<=CIs[1:nrep, 2]))print(paste("The following CI does not cover the true value:", ID))
[1] "The following CI does not cover the true value: 5"
[2] "The following CI does not cover the true value: 14"
[3] "The following CI does not cover the true value: 23"
[4] "The following CI does not cover the true value: 27"
[5] "The following CI does not cover the true value: 66"
Code
# initialize the plot for first 100 CIsplot(0, xlim =c(1, nrep), ylim =c(1-sd(CIs[,1])*4.3, 1+sd(CIs[,2])*4.3), xlab ="Confidence intervals in repeated random samples", ylab =expression(beta[1]~"="~1), main =bquote("Simulating"~.(nrep)~"confidence intervals;"~.(length(ID))~"out of"~.(nrep)~"do not cover the true value 1"))# set up color vectorcolors<-rep("grey60", nrep)# color grey for all 100 CIs colors[ID]<-"salmon"# color salman for CIs not covering the true value# draw reference line at meta1=1abline(h =1)# add first nrep=100 confidence intervals with corresponding x,y coordinates for(jin1:nrep){lines(x =c(j,j), y =c(CIs[j, 1], CIs[j, 2]), col =colors[j], lwd =3.6)}
As we can see, 0.1 is contained in the 95% CI for educ. Therefore, a 10% effect of one more year of education cannot be ruled out with a significance level of 5%. This is the same result as we get earlier with a corresponding t-test
# Calculating the 99% confidence intervals for all parametersconfint(out, level =0.99)
Data set mlb1 from the Wooldridge package, n = 353 major league baseball players
The last three variables are measures of performance. We want to test whether the salaries of the players depend on these, given the years as player and the frequency of run-outs
Our null hypothesis is H_0: \beta_3=0,\ \beta_4=0, \ \beta_5=0 against the H_1: at least one parameter ≠ 0
As we have to deal with multiplerestrictions in this case, our usual t-test is not applicable any more
One possible general approach is as follows – Wald or F Tests
We first estimate the model without imposing the restrictions; this is the unrestricted version
After that, we estimate the model again, this time enforcing the restrictions; this is the restricted version (actually, this step is not really required for a Wald test, but it is easier to explain the test idea this way)
In the case of zero restrictions, like \beta_j=0, \, \ldots, (called exclusion restrictions), this is accomplished very easily by simply dropping the corresponding variables from the model
In the case of more complicated restrictions involving several parameters, like \beta_i - \beta_j=0 or \beta_i+\beta_j=1, a reparameterization of the model is necessary: For the first case we could define a new parameter \theta=\beta_i - \beta_j and plug in for \beta_i=\theta+\beta_j into the model and collect the variables accordingly. Afterwards, we drop the variable corresponding to \theta
Clearly, imposing restrictions on the model leads to a poorer fit. The test idea is that the H_0has to be rejected if the fit of the model worsens too much. In this case, the data, along with the model of course, are apparently not compatible with the null hypothesis
Hence, we are looking for a test statistic which appropriately measures the decrease in the model’s fit caused by the imposition of the restrictions and from which we know the distribution
Thereby SSR \equiv \sum\nolimits _{i=1}^n \hat u_i^2 is defined for both the restricted and unrestricted model
This statistic measures the percentage decease in the fit (the percentage increase in SSR), corrected for the degrees of freedom of the nominator – m is number of restrictions – and of the denominator – (n-k-1) are the degrees of freedom in calculating \hat \sigma_{ur}^2
There is an equivalent version of this test statistic in terms of the R^2 instead of SSR. But this version is only applicable if the dependent variable is the same for both the restricted and unrestricted model
This measures the decrease in the model’s fit in terms of a decrease in the R^2
Note, for both versions the nominator is always positive, as a (binding) restriction always reduces the model’s fit
One can show that the test statistic introduced above is Fdistributed with m and (n-k-1) degrees of freedom, as both the nominator as well as the denominator are (scaled) \chi^2distributed random variables, which is the definition of a F-distribution. This test statistic is simply called F-statistic
Important Remark: The t-statistic introduced above is a special case of the F-statistic if we only test one restriction, i.e., m=1. Actually, in this case, the F-statistic is simply the square of the t-statistic. Therefore, any (two-tailed) t-test can also be formulated as a F-test
Wald statistic
As the denominator of the F-statistic is simply the estimate of \sigma_{ur}^2, which can be consistently estimated by OLS, there is an asymptotic version of the F-statistic, which we call W. Thereby, \hat \sigma_{ur}^2 in F is replaced by \sigma_{ur}^2 (assuming, the number of observations is large enough so that the estimate for \sigma_{ur}^2 has already converged to the true value) and F is further multiplied by m
W, which is called Wald statistic, is \chi^2 distributed as the denominator is no random variable any more but simply a scaling variable required so that W is actually \chi^2 distributed
Many econometric software packages let the user choose whether the F or \chi^2 version of the test should be used
3.8.1 Carrying out the test
Carrying out a F-test is basically the same the procedure as with a t-test
We postulate a null hypothesis
We determine the critical value as 1 minus the 0.95 quantil of the corresponding F or \chi^2 distribution
We calculate the test statistic and check whether it falls into the rejection region, i.e., whether the test statistic is larger than the critical value
Alternatively, we can calculate the p-value corresponding to the test statistic
The graphics below illustrates this procedure
Note, that the F and \chi^2 distribution are very similar in this case because the denominator’s degrees of freedom is quite high (347). The numerator’s degrees of freedom is 3 – we test for three restrictions;
Both test statistics correspond to the baseball player example presented above. Further note the different scaling of the two distributions which amounts to 3
As shown after the figures, the F-statistic and W-statistic for the baseball player example are 9.55 and 28.65, respectively. Is the H_0 rejected?
Code
# Plot the F(3,100) Distribution on interval [-5,5]df1<-3df2<-353-6end<-5a<-0.05hmax<-df(qf(0.3, df1, df2), df1, df2)# not really hmaxt<-seq(0, 5, 0.01)cv<-round(qf(1-a, df1, df2), digits =2)plot(x =t, y =df(t, df1, df2), type ="l", col ="black", lwd =2, main =bquote("Test with F("~.(df1)~", "~.(df2)~") distribution and"~alpha~"="~.(a)), ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=0:end, pos =c(0,0))axis(side=2)# critical region in right tailpolygon(x =c(cv, seq(cv, 5, 0.01), 5), y =c(0, df(seq(cv, 5, 0.01), df1, df2), 0), col ='cornflowerblue')text(end*0.75, hmax/2, labels =bquote(alpha~"="~.(a)),cex =1)arrows(end*0.75, hmax/2.2, end*0.65, hmax*0.05, length =0.1)text(end*0.15, hmax/2.8, labels =bquote(qf[0.95]~" = "~.(cv)), cex =1)arrows(end*0.15, hmax/3.2, cv*0.99, hmax*0.01, length =0.08)
Figure 3.9: F-distribution
Code
# Plot the ChiSq(3) Distribution on interval [-5,5]df<-3a<-0.05end<-df*5hmax<-dchisq(qchisq(0.3, df), df)# not really hmaxt<-seq(0, end, 0.01)cv<-round(qchisq(1-a, df), digits =2)plot(x =t, y =dchisq(t, df), type ="l", col ="black", lwd =2, main =bquote("Test with ChiSq("~.(df)~") distribution and"~alpha~"="~.(a)), ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=0:end, pos =c(0,0))axis(side=2)# critical region in right tailpolygon(x =c(cv, seq(cv, end, 0.01), end), y =c(0, dchisq(seq(cv, end, 0.01), df), 0), col ='cornflowerblue')text(end*0.75, hmax/2, labels =bquote(alpha~"="~.(a)),cex =1)arrows(end*0.75, hmax/2.2, end*0.65, hmax*0.05, length =0.1)text(end*0.15, hmax/2.8, labels =bquote(qchisq[0.95]~" = "~.(cv)), cex =1)arrows(end*0.15, hmax/3.2, cv*0.99, hmax*0.01, length =0.08)
Figure 3.10: Chi-distribution
3.8.2 Baseball example continued
We estimate the baseball player model; unrestricted version
The H_0 that none of the performance variables have any impact on the salary of the players is strongly rejected, although the variables on its own right are not significant – multicollinearity problem
We carry out the same test with lht() (for linearHypothesis), contained in the AER package
Applying this procedure we notice that actually, it is not really necessary to estimate the restricted model to carry out the test. We only need the unrestricted model, as it was the case with the simple t-test
A special case of the F-test is a test of the overall significance of a regression with the
H_0: \beta_1 = \beta_2 = \ldots = \beta_k = 0
This null hypothesis states that the explanatory variables overall are not useful in explaining the dependent variable
This test of overall significance is routinely calculated and reported in most regression packages as “the” F-statistic; the null hypothesis is usually overwhelmingly rejected
If this H_0 cannot be rejected, the model is of no value in explaining the dependent variable
3.8.4 Additional examples
As a last example in this chapter we will re-estimate our wage equation
We will test whether the effect of one additional year of education has a threefold effect than the effect of one additional year of tenure; H_0: \beta_1 = 3 \beta_3
As this hypothesis involves only one restriction, a t-test could be used. However, the restriction affects two parameters, therefore, a reparameterization of the model with a variable transformation would be necessary to implement this restriction
Using the R function lht() (for linearHypothesis) of the car and AER package implements this test very easily
Multiplying out we see that, in this case, we test whether \hat \beta_2=0and\hat \beta_3=0.
We have the following important Theorem:
Theorem 3.3 (Distribution of a quadratic form) Let \mathbf z be a N(\mathbf 0, \mathbf V) normal distributed random vector with h elements. Then the quadratic form
\mathbf z' \mathbf V^{-1} \mathbf z
is distributed as
\mathbf z' \mathbf V^{-1} \mathbf z \ \sim \ \chi^2(h)
(see Appendix A.6 and A.9).
Assuming that \sigma^2 is a known constant and applying covariance formula from Equation C.7 to the restriction vector (R \boldsymbol{\hat{\beta}} - \mathbf c), we get the covariance matrix of this random vector:
As \boldsymbol{\hat \beta} is asymptotically normally distributed (or normally distributed in small samples if \mathbf u is normally distributed), provided H_0 is true, the restriction vector is
This Wald test statistic is a weighted sum (by the inverse of the covariance matrix) of the distance between the hypothesized values \mathbf c and the actual estimation outcomes \mathbf R \boldsymbol{\hat \beta}. If this normed distance is too large, H_0 is rejected.
For m = 1 (only one restriction), \sqrt W reduces to a simple t-statistic.
If \sigma^2 is unknown, it has to be estimated by \hat \sigma^2={\sum_{i=1}^n \hat u_i^2} \ / \ {(n-k-1)}.
With the help of the above theorem, it can be shown that this estimator (after scaling with \sigma^2) is distributed as \chi^2(n-k-1)
Hence, in this case, the Wald-statistic Equation 3.13 is the ratio of two \chi^2, independently distributed random variables and therefore is F-distributed after appropriate scaling (see Appendix A.7):
F \ = \ \frac{(\mathbf R \boldsymbol{\hat{\beta}} - \mathbf c)' \left(\hat\sigma^{2} \mathbf R\left(\mathbf X' \mathbf X\right)^{-1} \mathbf R'\right)^{-1}(\mathbf R \boldsymbol{\hat{\beta}} - \mathbf c)}{m} \ \sim \ F(m,n-k-1)
\tag{3.14}
Note, in small samples this formula is valid only if \mathbf u is normally distributed. In large samples, Equation 3.14 times m converges to Equation 3.13 and so the two test statistics are asymptotically equivalent.
Furthermore, it can be shown that the formula above is equivalent to:
The interpretation of this test-statistic is straightforward: If the percentage decrease of the fit (measured by SSR) of the restricted model is too large compared to the model without imposing the restrictions, the restricted model is rejected.
Wooldridge, Jeffrey M. 2019. Introductory Econometrics: A Modern Approach. Seventh ed. Boston: Cengage.
---title-block-banner: truesubtitle: "Mainly based on @wooldridge_Intro_Econometrics, Chapter 4"---# Inference\pagebreak## Topics of this chapterThis chapter deals with statistical inference in the regression model, particularly with- **Hypothesis tests** about population parameters- Construction of **confidence intervals**For both we need the **distribution** of the OLS estimators for repeated samples- The OLS estimators $\hat \beta_j$ and $\hat \sigma^2$ depend on the particular sample and are therefore **random variables**- We already know their *expected values* and their *variances*- However, for hypothesis tests we need to know their *distribution*- In order to derive their distribution we need *additional* assumptions, at least for *small* samples - MLR.6: Errors are normally distributed $$ u_i \ \sim \ N(0, \sigma^2) $$ which implies that the depended variable $y$ is normally distributed as well, conditional on the values of the explanatory variables in $\mathbf x$$$y_i \, | \, \mathbf x \ \sim \ N(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k, \, \sigma^2) $$------------------------------------------------------------------------## Statistical inference in generalDiscussion of the normality assumption- The error term $u_i$ is interpreted as the sum of **many different unobserved factors**- According to the *Central Limit Theorem* (**CLT**, see @thm-CLT), sums (means) of many independent factors tends to be normally distributed- Problems: - Are the conditions for the CLT met? In particular: - How many different factors? Is the number large enough? - Possibly very heterogeneous distributions of individual factors - How independent are the different factors from each other?The normality of the error term is an *empirical question*- At least the error distribution should be "close” to *normal*- In many cases, normality is questionable or even impossible *by definition* - Examples where normality cannot strictly hold: - Wages (non-negative; also: minimum wage) - Number of arrests (takes on a small number of integer values) - Employed/not employed (indicator, respectively, binary dependent variable, takes on only 1 or 0)- In some cases, normality can be achieved through transformations of the dependent variable (e.g. use log(wage) instead of wage)> **Very important**: Under certain regularity conditions regarding the data and $u_i$ the assumption of **normality** can be *replaced* by a **large sample size**. In that case the error term needs not to be normally distributed.------------------------------------------------------------------------The following terminology is often used:$$\underbrace{\text{MLR.1 - MLR.5}}_{\text{Gauss-Markov assumptions}} \quad \quad \underbrace{\text{MLR.1 - MLR.6}}_{\text{Classical linear model (CLM) assumptions}}$$------------------------------------------------------------------------Furthermore, we have the following important theorem:::: {#thm-normal}## Normal sampling distributionsUnder assumptions MLR.1 - MLR.6$$\hat \beta_j \ \sim \ N \left( \beta_j, \operatorname{Var}(\hat \beta_j) \right)$$ {#eq-betahat_normal}That means that, **given the assumptions above**, the OLS estimator $\hat \beta_j$ is *unbiased* and *normally distributed* around the *true* parameter with a variance that was derived earlier.Additionally:$$\dfrac {\hat \beta_j - \beta_j}{sd(\hat \beta_j)} \ \sim \ N(0,1) $$ {#eq-standard_beta_normal}This means that the *standardized* estimator follows a *standard* normal distribution:::------------------------------------------------------------------------::: {.callout-tip collapse="true" icon="false"}## Proof of @thm-normalThe proof is quite straightforward if we use matrix notation, see @sec-matrixThe model in matrix notation is:$$\mathbf y = \mathbf X \boldsymbol \beta + \mathbf u $$ And the OLS estimator for $\boldsymbol{\hat \beta}$ is (see @eq-C_betahat):$$\hat {\boldsymbol \beta} = (\mathbf X' \mathbf X)^{-1} \mathbf X' \mathbf y $$ This can be written as (@eq-C_betahat_alternatine):$$\hat {\boldsymbol \beta} = \boldsymbol \beta + \underbrace {(\mathbf X' \mathbf X)^{-1} \mathbf X' \mathbf u}_{\mathbf {Au}} \tag{3.9}$$- The typical element of the vector $\mathbf {Au}$ is $\sum_{j=1}^n a_{ij} \, u_j$ - Hence, the OLS estimator $\hat {\boldsymbol \beta}$ is a linear combination of the normally distributed errors $u_{j}$ - According to the **Reproductive Property** of the normal distribution, $\hat {\boldsymbol \beta}$ therefore is normally distributed as well:$$\hat {\boldsymbol \beta} \, | \, \mathbf X \ \sim \ N \left( \boldsymbol \beta, \ \sigma^2 (\mathbf X' \mathbf X)^{-1} \right) $$- Here, we used the fact that under MLR.1 - MRL.4, $\hat {\boldsymbol \beta}$ is unbiased- For the derivation of the covariance matrix of $\hat {\boldsymbol \beta}$ we used MLR.5 and formula @eq-C_covmatrix_lin_combination:::------------------------------------------------------------------------## Testing hypotheses about a single population parameterWe have the following theorem:::: {#thm-t_distribution}## t-distribution for the standardized estimatorsGiven MLR.1 - MLR.6, we state:$$\dfrac {\hat \beta_j - \beta_j}{se(\hat \beta_j)} \ \sim \ t_{n-k-1} $$ {#eq-standbeta_t_distribution}:::- That means that if the standardization in @eq-standard_beta_normal is done using the **estimated** *standard deviation* of $\hat \beta_j$, the normal distribution from @eq-standard_beta_normal is replaced by a t-distribution - The **estimated** *standard deviation* of $\hat \beta_j$ is called **standard error** - The standard error, se($\hat \beta_j$), is calculated with $\hat \sigma^2$ instead of the true $\sigma^2$ in @eq-se_beta_hat_hat. Therefore, se($\hat \beta_j$) is itself a random variable which is usually $\chi^2$ distributed- In this case, we actually have a **ratio** of **two** random variables in @eq-standbeta_t_distribution -- a normal over a $\chi^2$ distributed random variable, which usually *leads to a t-distribution*. That's our heuristic **proof** of @thm-t_distribution- A very **important special case** arises if we test the particular null hypothesis $H_0 : \beta_j=0$ - This means that we *hypothesize* that the **true** population parameter is **equal to zero**, i.e. **there is no effect of** $x_j$ **on** $y$ *after controlling for the other independent variables* - This is the so called **Significance Test**. In this case, presuming the $H_0$ is true, @eq-standbeta_t_distribution reduces to `the` **t-statistic**::: {.callout-important appearance="simple" icon="false"}$$t_{\hat \beta_j} \ \equiv \ \dfrac {\hat \beta_j - 0}{se(\hat \beta_j)} \ = \ \dfrac {\hat \beta_j}{se(\hat \beta_j)} \ \sim \ t_{n-k-1}$$ {#eq-the_t-stat}:::- If we, more generally, hypothesize that $\beta_j = a_j$, the test statistic is (*presuming the* $H_0$ *is true*)$$\dfrac {\hat \beta_j-a_j} {se(\hat \beta_j)} \ \sim \ t_{n-k-1} $$ {#eq-eq-t-stat_gen}> **Generally, the t-statistic is a standardized measure of the distance between the estimated parameter and its hypothesized value** (which is zero or $a_j$)> In particular, it measures **how many estimated standard deviations the estimated coefficient is away from the** $H_0$ (0 or $a_j$)- The **basic test idea** is the following: The farther the estimated coefficient is away from the hypothesized value zero (or $a_j$), the less likely it is that the null hypothesis holds true - But what does “far away" from zero (or $a_j$) mean? This is *measured* by the t-statistic above and *assessed* with the help of the distribution of the t-statistic - If the $H_0$ is true, we know from @thm-t_distribution that the distribution of the t-statistic is the t-distribution (so, the name of the t-statistic is not accidentally) - If the degrees of freedom of the t-distribution are greater than 60 (many observations), the t-distribution is practical identical to a standardized $N(0,1)$ normal distribution------------------------------------------------------------------------### Testing procedure for a single parameterHow is a test procedure actually carried out?1. From @thm-t_distribution, we know the distribution of the test statistic2. We carry out the estimation of $\beta_j$ and se($\beta_j$) and calculate the test statistic @eq-eq-t-stat_gen3. We want to test a hypothesis $H_0$ about the true value of the estimated parameter -- actually we want to reject the $H_0$4. The test statistic shows how far away the estimate $\hat \beta_j$ is from the hypothesized value5. To asses whether the test statistic is far away enough to reject the $H_0$ we construct a **rejection region** so that, if the *null hypothesis is* **true**, the test statistic only *accidentally* falls into this rejection region, for instance in only 5% of the cases. This small probability is freely chosen and called **significance level** i. The *significance level* $\alpha$ is the probability of making a so called **type I error** or $\alpha$**-error**, which is the probability to **falsely reject a true hypothesis accidentally** (only due to sampling errors). In most cases $\alpha$=0.05 (5%) is chosen, but $\alpha$=0.01 (1%) or $\alpha$=0.1 (10%) are also quite common ii. The *boundary points* of the rejection region are called **critical values**. These are the values which are exceeded (deceeded) in x% of the cases. Formally, we are looking for the corresponding quantile of the test statistic's distribution6. Having constructed the rejection region of the test, we *reject* the null hypothesis $H_0: \beta_j=a_j$ in favor of the *alternative hypothesis* if the *test statistic falls into this rejection region*. In this case the estimated value is “*too far away*” from the hypothesized value7. There are *two types* of alternative hypothesis $H_1$: i. $H_1: \beta_j \neq a_j$ for a *two sided* or *two-tailed* test ii. either $H_1: \beta_j>a_j$ or $H_1: \beta_j<a_j$ for a *one sided* or *one-tailed* test iii. For one-sided tests, it is sometimes not obvious, whether one could apply a left- or right sided one. There is a simple rule: If the t-statistic is negative, we can only apply a left-sided test, and if the t-statistic is positive, we can only apply a right-sided test. Otherwise, the $H_0$ could never be rejected8. Two-tailed tests are much more common. One-tailed tests are used if the researcher is only interested in a particular "direction" of the parameter, e.g., with a unit-root test (a particular test for the stationarity of time series) one is only keen on whether the parameter of interest might be $\geq 0$------------------------------------------------------------------------```{r, fig.align='center', fig.height=5.5, fig.width=9}#| code-fold: true#| label: fig-one_tailed#| fig-cap: "Distribution of the t-statistic - in this case a normal distribution"# Plot the normal distribution on interval [-5,5]t <-seq(-5, 5, 0.01)cvl <-round( qnorm(0.95), digits =2) # critical value as quantile of the norm. distrib.plot(x = t, y =dnorm(t), type ="l", col ="black", lwd =2, main =expression("One-tailed test with normal distrib. and"~ alpha ~"= 0.05; e.g., H"[0]:~ beta ~"≤ 0, H"[1]:~ beta ~" > 0"),ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=-5:5, pos =c(0,0)) axis(side=2) # critical region in right tailpolygon(x =c(cvl, seq(cvl, 5, 0.01), 5),y =c( 0, dnorm(seq(cvl, 5, 0.01)), 0 ), col ='cornflowerblue')# texttext(3.5, 0.15, labels =expression( "Rejection region,"~ alpha ~"= 0.05"),cex =1)arrows(3.6, 0.133, 2.3, 0.04, length =0.08)text(0, 0.1, labels =bquote( qn[0.95] ~" = "~ .(cvl)), cex =1)arrows(0, 0.085, cvl-0.03, 0.002, length =0.08)text(-3.5, 0.3, labels =expression("Distribution of the test statistic"),cex =1)arrows(-3.6, 0.28, -1.22, 0.2, length =0.08)```------------------------------------------------------------------------```{r, fig.align='center', fig.height=5.5, fig.width=9}#| code-fold: true#| label: fig-two_tailed#| fig-cap: "Distribution of the t-statistic - in this case a normal distribution"# Plot the normal distribution on interval [-5,5]t <-seq(-5, 5, 0.01)# lower and upper critical valuescvl <-round( qnorm(0.025), digits =2 ); cvr <-round( qnorm(0.975), digits =2)plot(x = t, y =dnorm(t), type ="l", col ="black", lwd =2, main =expression("Two-tailed test with normal distrib. and"~ alpha ~"= 0.05; e.g., H"[0]:~ beta ~"= 0, H"[1]:~ beta ~" ≠ 0"),ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=-5:5, pos =c(0,0)) axis(side=2) # critical region in left tailpolygon(x =c(-5, seq(-5, cvl, 0.01), cvl),y =c( 0, dnorm(seq(-5, cvl, 0.01)), 0 ), col ='cornflowerblue')# critical region in right tailpolygon(x =c(cvr, seq(cvr, 5, 0.01), 5),y =c( 0, dnorm(seq(cvr, 5, 0.01)), 0 ), col ='cornflowerblue')text(-4, 0.15, labels =expression("0.025"~"="~over(alpha, 2)),cex =1)arrows(-4, 0.13, -2.5, 0.026, length =0.08)text(4, 0.15, labels =expression("0.025"~"="~over(alpha, 2)),cex =1)arrows(4, 0.13, 2.5, 0.026, length =0.08)text(-0.5, 0.15, labels =bquote( qn[0.025] ~" = "~ .(cvl)), cex =1)arrows(-0.5, 0.135, cvl+0.03, 0.002, length =0.08)text(0.5, 0.1, labels =bquote( qn[0.975] ~" = "~ .(cvr)), cex =1)arrows(0.5, 0.085, cvr-0.03, 0.002, length =0.08) ```------------------------------------------------------------------------```{r, fig.align='center', fig.height=5.5, fig.width=9}#| code-fold: true#| label: fig-two_tailed1#| fig-cap: "Distribution of the t-statistic - in this case a t-distribution"# Plot the t(4) Distribution on interval [-5,5]t <-seq(-5, 5, 0.01)# lower and upper critical valuescvl <-round( qt(0.025, 4), digits =2 ); cvr <-round( qt(0.975, 4), digits =2)plot(x = t, y =dt(t, 4), type ="l", col ="black", lwd =2, main =expression("Two-tailed test with t(4) distribution and"~ alpha ~"= 0.05; e.g., H"[0]:~ beta ~"= 0, H"[1]:~ beta ~" ≠ 0"),ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=-5:5, pos =c(0,0)) axis(side=2) # critical region in left tailpolygon(x =c(-5, seq(-5, cvl, 0.01), cvl),y =c( 0, dt(seq(-5, cvl, 0.01), 4), 0 ), col ='cornflowerblue')# critical region in right tailpolygon(x =c(cvr, seq(cvr, 5, 0.01), 5),y =c( 0, dt(seq(cvr, 5, 0.01), 4), 0 ), col ='cornflowerblue')text(-4.5, 0.15, labels =expression("0.025"~"="~over(alpha, 2)),cex =1)arrows(-4.5, 0.13, -3.8, 0.02, length =0.08)text(4.5, 0.15, labels =expression("0.025"~"="~over(alpha, 2)),cex =1)arrows(4.5, 0.13, 3.8, 0.02, length =0.08)text(-2.2, 0.18, labels =bquote( qt[0.025] ~" = "~ .(cvl)), cex =1)arrows(-2.2, 0.165, cvl+0.03, 0.0015, length =0.08)text(0.55, 0.09, labels =bquote( qt[0.975] ~" = "~ .(cvr)), cex =1)arrows(0.65, 0.075, cvr-0.03, 0.002, length =0.08) ``````{r, fig.align='center', fig.height=5.5, fig.width=9}#| code-fold: true#| label: fig-two_tailed2#| fig-cap: "Distribution of the t-statistic - in this case a t-distribution "# Plot the t(3) Distribution on interval [-5,5]t <-seq(-5, 5, 0.01)# lower and upper critical valuescvl <-round( qt(0.01, 4), digits =2 ); cvr <-round( qt(0.99, 4), digits =2)plot(x = t, y =dt(t, 4), type ="l", col ="black", lwd =2, main =expression("Two-tailed test with t(4) distribution and"~ alpha ~"= 0.02; e.g., H"[0]:~ beta ~"= 0, H"[1]:~ beta ~" ≠ 0"),ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=-5:5, pos =c(0,0)) axis(side=2) # critical region in left tailpolygon(x =c(-5, seq(-5, cvl, 0.01), cvl),y =c( 0, dt(seq(-5, cvl, 0.01), 4), 0 ), col ='cornflowerblue')# critical region in right tailpolygon(x =c(cvr, seq(cvr, 5, 0.01), 5),y =c( 0, dt(seq(cvr, 5, 0.01), 4), 0 ), col ='cornflowerblue')text(-4.5, 0.15, labels =expression("0.01"~"="~over(alpha, 2)),cex =1)arrows(-4.5, 0.13, -4.1, 0.015, length =0.08)text(4.5, 0.15, labels =expression("0.01"~"="~over(alpha, 2)),cex =1)arrows(4.5, 0.13, 4.1, 0.015, length =0.08)text(-2.2, 0.18, labels =bquote( qt[0.01] ~" = "~ .(cvl)), cex =1)arrows(-2.2, 0.165, cvl+0.03, 0.0015, length =0.08)text(0.55, 0.09, labels =bquote( qt[0.99] ~" = "~ .(cvr)), cex =1)arrows(0.65, 0.075, cvr-0.03, 0.002, length =0.08) ```------------------------------------------------------------------------### Example for tests - wage equation##### **We first estimate the model -- in this case for explaining the received wage of persons**```{r, message=FALSE}#| comment: " "library(AER); library(wooldridge); data("wage1")output <-summary( lm(lwage ~ educ + tenure + exper, data = wage1) )output```------------------------------------------------------------------------##### **We want to test whether `tenure` has any effect on lwage**- Significance Test: H0: b3 = 0, H1: b3 ≠ 0, **two-tailed**- Choosing the significance level alpha = 5%```{r}#| comment: " "alpha <-0.05# printing the coefficients, standard errors and t-statisticcoef(output)# "the" t-statistic: b3 / se(b3)coef(output)[3,1] /coef(output)[3,2]# the 2.5% and 97.5% quantile of the standardized normal distribution: rejection region in both tails# Rule of Thumb for two-tailed tests: reject the H0 if |t-statistic| ≥ 2c( qnorm(alpha/2), qnorm(1-alpha/2) )```The t-statistic is about 7, which means that the estimated coefficient for `tenure` is about seven times the standard deviation of the estimate away from 0.The critical values are about $\pm2$.Hence, the t-statistic clearly falls into the rejection region, thus the $H_0$ of no effect of `tenure` on `lwage` is overwhelming rejected. So, $H_1$ seems to be true, subject to a small type I error of course.------------------------------------------------------------------------##### **We want to test whether `tenure` has really a positive effect on lwage**- One sided test: H0: b3 ≤ 0, H1: b3 \> 0 -- **right-tailed**- We choose a significance level alpha = 5%```{r}#| comment: " "alpha <-0.05# printing the coefficients, standard errors and t-statisticcoef(output)# the t-statistic: b3 / se(b3)coef(output)[3,1] /coef(output)[3,2]# the 95% quantil of the standardized normal distribution: rejection region only in the right tailqnorm(1-alpha)```The t-statistic is about 7, which means that the estimated coefficient for `tenure` is about seven times the standard deviations of the estimate away from 0.The critical values is about 1.64.Hence, the t-statistic clearly falls into the rejection region, thus the $H_0$ of no or *a negative* effect of `tenure` on `lwage` is overwhelming rejected. So, $H_1$ seems to be true, subject to a small type I error of course.------------------------------------------------------------------------##### **We want to test whether `educ` has an effect on lwage of at least 10%:**- H0: b2 ≥ 0.1, H1: b2 \< 0.1 -- **left-tailed** (because estimated coefficient is less than 0.1)- Choosing the significance level alpha = 5%```{r}#| comment: " "alpha <-0.05#| comment: " "# printing the coefficients, standard errors and t-statisticcoef(output)# the t-statistic: (b2 - 0.1) / sd(b2) ( coef(output)[2,1] -0.1 ) /coef(output)[2,2]# the 5% quantil of the standardized normal distribution: rejection region in the left tailqnorm(alpha)```The t-statistic does not fall into the rejection region.Therefore $H_0$ is not rejected - so not enough evidence to be sure of $H_1$However, b2 \< 0.1, $H_1$, might be true anyway -- type II error------------------------------------------------------------------------## Types of errors and power of statistical tests {#sec-typesoferrors}- There are principally *two* types of errors in statistical tests - **Type I error** or $\alpha$-error - This is the probability to falsely *reject* a *true* $H_0$ due to sampling errors. This probability is chosen via the **significance level** by the researcher. So, why not choosing a very low significance level $\alpha$ to minimize this error? In this case, rejecting the $H_0$ would give us more confidence regarding the rejection. But the *downside* is that a low significance level increases another type of error - **Type II error** or $\beta$-error - This is the probability of *accepting* a *false* $H_0$. Looking at @fig-two_tailed1 and @fig-two_tailed2 it is pretty clear that reducing the significance level, for instance from $\alpha=$ 0.05 to 0.02, shrinks the rejection region and therefore enlarges the acceptance region. In this case, the probability of rejecting the $H_0$ in repeated samples is decreased even if the $H_0$ is false - The **power of a test** is the probability to reject a false $H_0$. Hence, this is *one minus the probability of making a type II error* and one of the most important and desirable properties of a test -- it is the ability of the test to discriminate between true and false hypothesis - Besides the *significance level* and the *standardized distance of the hypothesized value from the estimated one*, all factors which *reduce* the standard errors of the estimates *increase* the power of a test; most notably a *large number of observations* and a *low multicollinearity* respectively, *large variations of the explanatory variables*------------------------------------------------------------------------## p-valueThe *classical test strategy* of1) choosing a *significance level*2) calculating the test statistic3) looking whether the *test statistic is within the rejection region*\ (i.e. whether \|test statistic\| ≥ \|critical value\|)is somewhat arbitrary as the result of the test depends on the chosen significance level- As an **example**, suppose the observed *t-statistic* for a significance test, i.e. $H_0: \beta_j=0$ versus $H_1: \beta_j≠0$, equals 1.85 - Using a significance level of 0.05, the $H_0$ cannot be rejected -- the corresponding critical values, $q_{0.025}, q_{0.975}$, are about ±2 - However, using a significance level of 0.1, the $H_0$ can be rejected -- the corresponding critical values, $q_{0.05}, q_{0.95}$, are about ±1.65 - So, what conclusions should be drawn?------------------------------------------------------------------------- The concept of the **p-value** provides an *alternative* to this and answers the following question: *given the observed t-statistic, what is the* ***smallest*** *significance level at which the* $H_0$ would be ***rejected?***- This has the following interpretation: - *The p-value is the probability of observing a t-statistic as we did (or even larger in absolute terms) if the* $H_0$ is true - Hence, a *small* p-value is evidence *against* the $H_0$ as the actually observed (large) t-statistic is very unlikely a result of sampling errors in this case, and vice verse - Alternatively: A *small* p-value is evidence *against* the null hypothesis because one can reject the null hypothesis even at small significance levels (small error probability)- To calculate a p-value by hand is seldom necessary (the econometric package usually does the work). It could be done the following way, which, in a sense, is a reversal of the classical test strategy - Construct the rejection region, using ± of the *t-statistic as critical values* - Calculate the *implied significance level*. This is done with the **cdf()** (cumulative distribution function). For our example, the implied significance level is: $\operatorname {cdf}(-1.85) + (1 - \operatorname {cdf}(1.85))=0.072$.\ The R-code for an assumed $t_{40}$ distribution is $[1-pt(\operatorname {abs}(1,85), 40)] \times 2=0.072$- The following Figure @fig-pvalue illustrates the concept of the p-value------------------------------------------------------------------------```{r, fig.align='center', fig.height=5.5, fig.width=9}#| code-fold: true#| label: fig-pvalue#| fig-cap: "Distribution of the t-statistic"# Plot of prob.valueb=0# beta hatsd=1# se(beta hat)t=1.85df=40a=round( (1-pt(abs(t), df))*2, digits =3 ) # prob value#a=0.036 # significance level#beg=round(-4)end=round(4)maxh=dnorm(0)x <-seq(beg, end, 0.01)cvl <--tcvr <- tplot(x = x, y =dt(x, df), type ="l", col ="black", lwd =2, main =bquote( "Observed t-statistic = 1.85,"~ t[40] ~"distrib. => p-Value = [1 - pt(| t |, 40)]*2 = "~ .(round( a, digits=3))),ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=beg:end, pos =c(0,0)) axis(side=2) # critical region in left tailpolygon(x =c(beg, seq(beg, cvl, 0.01), cvl),y =c( 0, dt(seq(beg, cvl, 0.01), df), 0 ),col ='cornflowerblue')# critical region in right tailpolygon(x =c(cvr, seq(cvr, end, 0.01), end),y =c( 0, dt(seq(cvr, end, 0.01), df), 0 ),col ='cornflowerblue')# lefttext(beg+1, maxh*0.3, labels =bquote( over(alpha,2) ~" = 1 - pt(| t |, 40) = "~ .(a/2) ), cex =1)arrows(beg+1, maxh*0.26, cvl*1.18 , maxh*0.045, length =0.08)# righttext(end-1, maxh*0.3, labels =bquote( over(alpha,2) ~" = 1 - pt(| t |, 40) = "~ .(a/2) ), cex =1)arrows(end-1, maxh*0.26, cvr*1.18, maxh*0.045, length =0.08)# acceptance text(end-1, maxh*0.85, labels =bquote( "(1 -"~ alpha ~") = "~ .(1-a) ), cex =1)arrows(end-1, maxh*0.81, end*0.1, maxh*0.6, length =0.08)# lefttext( -0.8, maxh*0.35, labels =bquote( .(-t)), cex =1)arrows(-0.8, maxh*0.31, -t, maxh*0.003, length =0.08)# righttext(0.8, maxh*0.35, labels =bquote( .(t)), cex =1)arrows(0.8, maxh*0.31, t, maxh*0.003, length =0.08)text(-2.8, 0.3, labels =bquote("Distribution of the test statistic,"~ t[40]),cex =1)arrows(-2.8, 0.28, -1.12, 0.22, length =0.08)```------------------------------------------------------------------------## General remarks on parameter testing**Guidelines for discussing economic and statistical significance**- All statistical tests described in this chapter require at least MLR.1 - MLR.5 (for small samples, we additionally need MLR.6). *Violations of some assumptions need special treatments*- If a variable is statistically significant, discuss the *magnitude* of the coefficient to get an idea of its *economic or practical importance*- The fact that a coefficient is statistically significant does not necessarily mean it is economically or practically significant! - This is especially true for data sets with very many observations. For instance, if you have thousands of observations in a cross section data set, *nearly every effect is statistically significant, even the most trivial ones*- If a variable is statistically *and* economically important but has the “wrong” sign, the regression model might be misspecified- If a variable is statistically insignificant at the usual levels (10%, 5%, or 1%), one may think of dropping it from the regression - However, if the sample size is small, effects might be imprecisely estimated so that the case for dropping insignificant variables is less strong------------------------------------------------------------------------## Confidence intervalsUntil now, we have dealt with *point estimates* of the unknown parameters. But according @thm-normal, we know that$$ {\hat \beta_j} \ \sim \ N\left(\beta_j, \, sd(\hat \beta_j)^2\right) $$Now, imagine we want to construct an *interval estimate* of the true parameter $\beta_j$ which covers the true parameter with a probability of say, 95% (provided our model is correctly specified, of course)- To do so, we need the 0.025 and 0.975 quantils of this distribution. It turns out that this quantils are simply the ones of the standardized normal distribution -- or the t distribution, if $\sigma^2$ is unknown and has to be estimated with only a few observations -- *times the estimated standard deviation of* $\hat \beta_j$, $se(\hat \beta_j)$- This argument leads to the so called **95% Confidence Intervals**, which *covers the true parameter with a probability of 95%*. As the distribution at hand is symmetric, we have $q_{0.025}=-q_{0.975}$ and therefore$$\beta_j \ \ \underset {95\%} \in \ \ \left[ \hat \beta_j \ ± \ q_{0.975} \times se(\hat \beta_j) \right]$$ {#eq-confinterval}This way, we heuristically constructed an *interval estimate* for $\beta_j$, *considering sample variability*. Thereby, $q_{0.975}$ denotes the 0.975 quantil of a standardized normal or a $t_{n-k-1}$ distribution (for fewer than 60 degrees of freedom)------------------------------------------------------------------------- As the 0.975 quantil is about 2 of most cases (except for very low degrees of freedom), we can **approximate** the 95% confidence interval by a very **simple formula**, which is often used in practice::: {.callout-important appearance="simple" icon="false"}$$\beta_j \ \ \underset {95\%} \in \ \ \left[ \hat \beta_j \ ± \ 2 \times se(\hat \beta_j) \right]$$ {#eq-confinterval1}:::- For a 99% confidence interval, use 2.6, and for a 90% confidence interval use 1.65 instead of 2. These approximations work very well for more than 60 degrees of freedom- For a more formal derivation of confidence intervals, we start with @thm-t_distribution which implies:$$P\left(-q_{0.975} \ < \ \dfrac{\hat{\beta}_{j}- \beta_j }{ se(\hat \beta_j)} \ < \ q_{0.975}\right)=0.95 \ \ \Rightarrow\left[\times se(\hat \beta_j)\right]\\P\left(-q_{0.975} \times se(\hat \beta_j) \ < \ \hat{\beta}_{j}- \beta_j \ < \ q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow\left[-\hat{\beta}_{j}\right] \\P\left(-q_{0.975} \times se(\hat \beta_j)-\hat{\beta}_{j} \ < \ -\beta_j \ < \ -\hat{\beta}_{j}+q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow[ \times -1]\\P\left(q_{0.975} \times se(\hat \beta_j)+\hat{\beta}_{j} \ > \ \beta_j > \ \hat{\beta}_{j}-q_{0.975} \times se(\hat \beta_j)\right)=0.95 \ \ \Rightarrow $$$$P\left(\hat{\beta}_{j}-q_{0.975} \times se(\hat \beta_j) \ < \ \beta_{j} \ < \ \hat{\beta}_{j}+q_{0.975} \times se(\hat \beta_j)\right) = 0.95 $$------------------------------------------------------------------------```{r, fig.align='center', fig.height=5.5, fig.width=9}#| code-fold: true#| label: fig-convinterval#| fig-cap: "Conficence Interval"# Plot of 95% confidence intervalb=2.3# beta hatsd=0.96# se(beta hat)a=0.05# significance level#beg=round(b-4*sd)end=round(b+4*sd)maxh=dnorm(b, b, sd)x <-seq(beg, end, 0.01)cvl <-round( qnorm(a/2, 0, 1), digits =3 )cvr <-round( qnorm(1-a/2, 0, 1), digits =3)plot(x = x, y =dnorm(x, b, sd), type ="l", col ="black", lwd =2, main =bquote( .((1-a)*100) ~"% Convidence Interval; e.g.,"~hat(beta) ~"="~ .(b) ~", "~se(hat(beta)) ~"="~ .(sd)),ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=beg:end, pos =c(0,0)) axis(side=2) # critical region in left tailpolygon(x =c(beg, seq(beg, cvl*sd + b, 0.01), cvl*sd + b),y =c( 0, dnorm(seq(beg, cvl*sd + b, 0.01), b, sd), 0 ), col ='cornflowerblue')# critical region in right tailpolygon(x =c(cvr*sd + b, seq(cvr*sd + b, end, 0.01), end),y =c( 0, dnorm(seq(cvr*sd + b, end, 0.01), b, sd), 0 ), col ='cornflowerblue')# confidence regionpolygon(x =c(cvl*sd + b, seq(cvl*sd + b, cvr*sd+b, 0.01), cvr*sd+b),y =c( 0, dnorm(seq(cvl*sd + b, cvr*sd+b, 0.01), b, sd), 0 ), col ='gray99')# Intervalsegments( x0=b-abs(cvl)*sd, y0=0.0, x1=b+abs(cvl)*sd, y1=0.0, lwd=5, col ="gray") rug( c(b-abs(cvl)*sd, b+abs(cvl)*sd ), ticksize =0.08, lwd =6, col ="darkgray")rug( b, ticksize =0.05, lwd =1.5)# lefttext(beg, maxh*0.3, labels =bquote( .(a/2) ), cex =1)arrows(beg, maxh*0.26, b+cvl*sd*1.18 , maxh*0.02, length =0.08)# righttext(end, maxh*0.3, labels =bquote( .(a/2) ), cex =1)arrows(end, maxh*0.26, b+cvr*sd*1.18, maxh*0.02, length =0.08)text(end, maxh*0.85, labels =bquote( .(1-a) ), cex =1)arrows(end, maxh*0.81, b+(end-b)*0.1, maxh*0.6, length =0.08)# lefttext( beg+(b-beg)/4, maxh*0.45, labels =bquote( hat(beta) ~"-"~.(abs(round(qnorm(a/2),digits=2))) %*%se(hat(beta)) ~" = "~ .(b-abs(cvl)*sd)), cex =1)arrows(beg+(b-beg)/4, maxh*0.4, cvl*sd+b, maxh*0.01, length =0.08)# righttext(end-(end-b)/4, maxh*0.45, labels =bquote( hat(beta) ~"+"~ .(abs(round(qnorm(a/2),digits=2))) %*%se(hat(beta)) ~" = "~ .(b+cvr*sd)), cex =1)arrows(end-(end-b)/4, maxh*0.4, cvr*sd+b, maxh*0.01, length =0.08)# beta hattext(b-0.1, maxh*0.2, labels =bquote(hat(beta) ~"="~ .(b)), cex =1.4)arrows(b-0.1, maxh*0.16, b, maxh*0.025, length =0.08) text(-1.0, 0.35, labels =expression("Estimated distribution of"~hat(beta)),cex =1)arrows(-1.1, 0.33, 1.11, 0.2, length =0.08)```------------------------------------------------------------------------```{r, fig.align='center', fig.height=5.5, fig.width=9}#| code-fold: true#| label: fig-convinterval1#| fig-cap: "Conficence Interval"# Plot of 99% confidence intervalb=2.3# beta hatsd=0.96# se(beta hat)a=0.01# significance level#beg=round(b-4*sd)end=round(b+4*sd)maxh=dnorm(b, b, sd)x <-seq(beg, end, 0.01)cvl <-round( qnorm(a/2, 0, 1), digits =3)cvr <-round( qnorm(1-a/2, 0, 1), digits =3)plot(x = x, y =dnorm(x, b, sd), type ="l", col ="black", lwd =2, main =bquote( .((1-a)*100) ~"% Convidence Interval; e.g.,"~hat(beta) ~"="~ .(b) ~", "~se(hat(beta)) ~"="~ .(sd)),ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=beg:end, pos =c(0,0)) axis(side=2) # critical region in left tailpolygon(x =c(beg, seq(beg, cvl*sd + b, 0.01), cvl*sd + b),y =c( 0, dnorm(seq(beg, cvl*sd + b, 0.01), b, sd), 0 ), col ='cornflowerblue')# critical region in right tailpolygon(x =c(cvr*sd + b, seq(cvr*sd + b, end, 0.01), end),y =c( 0, dnorm(seq(cvr*sd + b, end, 0.01), b, sd), 0 ), col ='cornflowerblue')# confidence regionpolygon(x =c(cvl*sd + b, seq(cvl*sd + b, cvr*sd+b, 0.01), cvr*sd+b),y =c( 0, dnorm(seq(cvl*sd + b, cvr*sd+b, 0.01), b, sd), 0 ), col ='gray99')# Intervalsegments( x0=b-abs(cvl)*sd, y0=0.0, x1=b+abs(cvl)*sd, y1=0.0, lwd=5, col ="gray") rug( c(b-abs(cvl)*sd, b+abs(cvl)*sd ), ticksize =0.08, lwd =6, col ="darkgray")rug( b, ticksize =0.05, lwd =1.5)# lefttext(beg, maxh*0.3, labels =bquote( .(a/2) ), cex =1)arrows(beg, maxh*0.26, b+cvl*sd*1.18 , maxh*0.02, length =0.08)# righttext(end, maxh*0.3, labels =bquote( .(a/2) ), cex =1)arrows(end, maxh*0.26, b+cvr*sd*1.18, maxh*0.02, length =0.08)text(end, maxh*0.85, labels =bquote( .(1-a) ), cex =1)arrows(end, maxh*0.81, b+(end-b)*0.1, maxh*0.6, length =0.08)# lefttext( beg+(b-beg)/4, maxh*0.45, labels =bquote( hat(beta) ~"-"~.(abs(round(qnorm(a/2),digits=2))) %*%se(hat(beta)) ~" = "~ .(b-abs(cvl)*sd)), cex =1)arrows(beg+(b-beg)/4, maxh*0.4, cvl*sd+b, maxh*0.01, length =0.08)# righttext(end-(end-b)/4, maxh*0.45, labels =bquote( hat(beta) ~"+"~ .(abs(round(qnorm(a/2),digits=2))) %*%se(hat(beta)) ~" = "~ .(b+cvr*sd)), cex =1)arrows(end-(end-b)/4, maxh*0.4, cvr*sd+b, maxh*0.01, length =0.08)# beta hattext(b-0.1, maxh*0.2, labels =bquote(hat(beta) ~"="~ .(b)), cex =1.4)arrows(b-0.1, maxh*0.16, b, maxh*0.025, length =0.08) text(-1.0, 0.35, labels =expression("Estimated distribution of"~hat(beta)),cex =1)arrows(-1.1, 0.33, 1.11, 0.2, length =0.08)```------------------------------------------------------------------------### Interpretation of the confidence interval- A confidence interval is constructed in the way that **in repeated samples it will cover the population regression coefficient (the true value) in 95% of the cases**- Hence, the bounds of the a confidence interval are *random*- We further have an important *relationship* between *confidence intervals* and *hypotheses tests*:$$a_j \notin interval \ \Rightarrow \ \text{ reject } H_0: \beta_j=a_j \, \text{ in favour of } H_1: \beta_j \neq a_j$$- This means that if a specific value is not in the confidence interval, we can reject the null hypothesis in a two-tailed test that $\beta_j$ is equal to this value- Hence, calculating confidence interval is an **alternative way to carry out parameter tests** - Likewise, this equivalence to parameter tests implies that violations of assumptions MLR.1 -- MLR.5 (MLR.6) may lead to *invalid* confidence intervals------------------------------------------------------------------------#### Simulating Confidence IntervalsTo demonstrate the *interpretation* of confidence intervals from above, we *simulate* the following simple regression model$$y = 3 + 1 \times x + u$$ with a sample size of *n*=100 and calculate the estimate for $\beta_1$ -- with a true value of one -- and the estimated corresponding 95% confidence interval with$$lower \ bound \ = \ \hat \beta_1 - q_{0.975} \times se(\hat \beta_1) \\\ upper \ bound \ = \ \hat \beta_1 + \underbrace {q_{0.975}}_{1.984} \times se(\hat \beta_1) $$ - This whole procedure is repeated 500 times. We plot the first 100 of the calculated 500 confidence intervals and look how many of them cover the true value of $\beta$=1- **Result**: Regarding the whole 500 simulated intervals, 94.4% of them cover the true value of one. From the first 100 plotted, 95 cover and 5 don't cover the true value. This is in a nearly perfect accordance with the expectations```{r, message=FALSE}#| code-fold: truelibrary(AER) # we need coeftest()set.seed(1) # set seedrep <-500# number of calculated confidence intervalsn <-100# sample size of regressionsnrep <-100# first nrep confidence intervals of the simulated rep, which are plotted # initialize vectors of lower and upper interval boundarieslower <-numeric(rep)upper <-numeric(rep)# loop of running regressions / estimation / CIfor(i in1:rep) {# model x <-rnorm(n, mean=5, sd=5) u <-rnorm(n) y =3+ x + u# regression ols <-lm(y ~ x)# CIs ctest <-coeftest(ols) lower[i] <- ctest[2,1] -qt(1-0.05/2, n) * ctest[2,2] upper[i] <- ctest[2,1] +qt(1-0.05/2, n) * ctest[2,2]}``````{r}#| code-fold: true# merge vectors of lower and upper confidence interval bounds into a matrix with two columnsCIs <-cbind(lower, upper)# fraction of how many confidence intervals cover true value of 1print( paste( mean( CIs[, 1] <=1&1<= CIs[, 2] )*100, "% of the simulated values are covered by the confidence interval" ) )# identify intervals not covering true value of 1 for the first nreps casesID <-which( !(CIs[1:nrep, 1] <=1&1<= CIs[1:nrep, 2]) )print( paste("The following CI does not cover the true value:", ID) )``````{r, fig.align='center', fig.width=9.5, fig.height=6}#| code-fold: true#| label: fig-convIntSim#| fig-cap: "Simulated confidence intervals"# initialize the plot for first 100 CIsplot(0, xlim =c(1, nrep),ylim =c( 1-sd(CIs[,1])*4.3, 1+sd(CIs[,2])*4.3), xlab ="Confidence intervals in repeated random samples", ylab =expression(beta[1]~"="~1), main =bquote("Simulating"~ .(nrep) ~"confidence intervals;"~ .(length(ID)) ~"out of"~ .(nrep) ~"do not cover the true value 1") )# set up color vectorcolors <-rep( "grey60", nrep) # color grey for all 100 CIs colors[ID] <-"salmon"# color salman for CIs not covering the true value# draw reference line at meta1=1abline(h =1)# add first nrep=100 confidence intervals with corresponding x,y coordinates for(j in1:nrep) {lines(x =c(j,j), y =c(CIs[j, 1], CIs[j, 2]), col = colors[j], lwd =3.6)}```------------------------------------------------------------------------### Confidence Intervals -- Example```{r, message=FALSE}#| comment: " "library(AER); library(wooldridge); data("wage1")# we once more estimate the our wage modelsummary( out <-lm(lwage ~ educ + tenure + exper, data = wage1) )```------------------------------------------------------------------------```{r}#| comment: " "# Now we calculate the 95% confidence intervals for all parameters with the R function confint() confint(out)```As we can see, 0.1 is contained in the 95% CI for `educ.` Therefore, a 10% effect of one more year of education cannot be ruled out with a significance level of 5%. This is the same result as we get earlier with a corresponding t-test```{r}#| comment: " "# Calculating the 99% confidence intervals for all parametersconfint(out, level =0.99)```------------------------------------------------------------------------## Testing multiple linear restrictionsTo give a motivation for this section it is best to start with an example- Suppose, we want to investigate whether the salary of baseball player depend on their achievements$$\begin{align*}log(salary) = \beta_0 &+ \beta_1\cdot years + \beta_2 \cdot gamesyr \\&+ \beta_3 \cdot bavg + \beta_4\cdot hrunsyr + \beta_5\cdot rbisyr + u\end{align*}$$- `years` = number of years as an active player- `gamesyr` = number of games played per year- `bavg` = batting average- `hrunsyr` = homeruns per year- `rbisyr` = runs batted in per year- Data set *mlb1* from the *Wooldridge* package, n = 353 major league baseball players- The *last three variables are measures of performance*. We want to test whether the salaries of the players depend on these, *given* the years as player and the frequency of run-outs- Our null hypothesis is $H_0: \beta_3=0,\ \beta_4=0, \ \beta_5=0$ against the $H_1:$ at least one parameter ≠ 0- As we have to deal with **multiple** *restrictions* in this case, our usual t-test is not applicable any more------------------------------------------------------------------------- One possible *general* approach is as follows -- **Wald or F Tests** - We first estimate the model without imposing the restrictions; this is the **unrestricted** version - After that, we estimate the model again, this time enforcing the restrictions; this is the **restricted** version (actually, this step is not really required for a Wald test, but it is easier to explain the test idea this way) - In the case of *zero restrictions*, like $\beta_j=0, \, \ldots$, (called **exclusion restrictions**), this is accomplished very easily by simply dropping the corresponding variables from the model - In the case of *more complicated restrictions* involving *several* parameters, like $\beta_i - \beta_j=0$ or $\beta_i+\beta_j=1$, a *reparameterization* of the model is necessary: For the first case we could define a new parameter $\theta=\beta_i - \beta_j$ and plug in for $\beta_i=\theta+\beta_j$ into the model and collect the variables accordingly. Afterwards, we drop the variable corresponding to $\theta$- Clearly, imposing restrictions on the model leads to a poorer fit. The **test idea** is that the $H_0$ *has to be rejected if the fit of the model worsens too much*. In this case, the data, along with the model of course, are apparently not compatible with the null hypothesis- Hence, we are looking for a *test statistic which appropriately measures the decrease in the model's fit caused by the imposition of the restrictions* and *from which we know the distribution*------------------------------------------------------------------------#### The F-staistic- The following test statistic does the job$$F \ = \ \frac{\left(SSR_r-SSR_{ur} \right) / m}{SSR_{ur} /(n-k-1)} \ \sim \ F(m, n-k-1) $$ {#eq-Fstat}Thereby $SSR \equiv \sum\nolimits _{i=1}^n \hat u_i^2$ is defined for both the restricted and unrestricted model- This statistic *measures the percentage decease in the fit* (the percentage ***in**crease* in *SSR*), corrected for the *degrees of freedom* of the *nominator* -- *m* is number of restrictions -- and of the *denominator* -- $(n-k-1)$ are the degrees of freedom in calculating $\hat \sigma_{ur}^2$- There is an equivalent version of this test statistic in terms of the $R^2$ instead of *SSR*. But this version is only applicable if the dependent variable is the same for both the restricted and unrestricted model$$F \ = \ \frac{(R_{ur}^2 - R_{r}^2) / m}{(1-R_{ur}^2) / (n-k-1)} \ \sim \ F(m, n-k-1) $$ {#eq-Fstat1}- This measures the decrease in the model's fit in terms of a decrease in the $R^2$- Note, for both versions the nominator is *always positive*, as a (binding) restriction always reduces the model's fit- One can show that the test statistic introduced above is *F* *distributed* with $m$ and $(n-k-1)$ degrees of freedom, as **both the nominator as well as the denominator are** (scaled) $\chi^2$ **distributed random variables**, which is the definition of a F-distribution. This test statistic is simply called ***F-statistic***- **Important Remark**: The t-statistic introduced above is a special case of the F-statistic if we only test *one* restriction, i.e., $m=1$. Actually, in this case, the F-statistic is simply the *square* of the t-statistic. Therefore, any (two-tailed) t-test can also be formulated as a F-test------------------------------------------------------------------------#### Wald statistic- As the denominator of the F-statistic is simply the estimate of $\sigma_{ur}^2$, which can be *consistently* estimated by OLS, there is an *asymptotic version* of the F-statistic, which we call *W*. Thereby, $\hat \sigma_{ur}^2$ in *F* is replaced by $\sigma_{ur}^2$ (assuming, the number of observations is large enough so that the estimate for $\sigma_{ur}^2$ has already converged to the true value) and *F* is further multiplied by *m*$$W \ = \ \frac{\left(SSR_r-SSR_{ur} \right)}{\sigma_{ur}^2} \ \sim \ \chi^2(m) $$ {#eq-Wstat}- *W*, which is called **Wald statistic**, is $\chi^2$ distributed as the denominator is no random variable any more but simply a scaling variable required so that *W* is actually $\chi^2$ distributed- Many econometric software packages let the user choose whether the *F* or $\chi^2$ version of the test should be used------------------------------------------------------------------------### Carrying out the test- Carrying out a F-test is basically the same the procedure as with a t-test - We postulate a null hypothesis - We determine the critical value as 1 minus the 0.95 quantil of the corresponding *F* or $\chi^2$ distribution - We calculate the test statistic and check whether it falls into the rejection region, i.e., whether the test statistic is larger than the critical value - Alternatively, we can calculate the p-value corresponding to the test statistic- The graphics below illustrates this procedure - Note, that the *F* and $\chi^2$ distribution are very similar in this case because the denominator's degrees of freedom is quite high (347). The numerator's degrees of freedom is 3 -- we test for three restrictions; - Both test statistics correspond to the *baseball player example* presented above. Further note the different scaling of the two distributions which amounts to 3 - As shown after the figures, the F-statistic and W-statistic for the baseball player example are 9.55 and 28.65, respectively. Is the $H_0$ rejected?------------------------------------------------------------------------```{r, fig.align='center', fig.height=6, fig.width=9, fig.cap="4.9"}#| code-fold: true#| label: fig-Fdistribution#| fig-cap: "F-distribution"# Plot the F(3,100) Distribution on interval [-5,5]df1 <-3df2 <-353-6end <-5a <-0.05hmax <-df( qf(0.3, df1, df2), df1, df2) # not really hmaxt <-seq(0, 5, 0.01)cv <-round( qf(1-a, df1, df2), digits =2 )plot(x = t, y =df(t, df1, df2), type ="l", col ="black", lwd =2, main =bquote("Test with F("~ .(df1) ~", "~ .(df2) ~") distribution and"~ alpha ~"="~ .(a)),ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=0:end, pos =c(0,0) )axis(side=2) # critical region in right tailpolygon(x =c(cv, seq(cv, 5, 0.01), 5),y =c( 0, df(seq(cv, 5, 0.01), df1, df2), 0 ), col ='cornflowerblue')text(end*0.75, hmax/2, labels =bquote(alpha ~"="~ .(a)),cex =1)arrows(end*0.75, hmax/2.2, end*0.65, hmax*0.05, length =0.1)text(end*0.15, hmax/2.8, labels =bquote( qf[0.95] ~" = "~ .(cv)), cex =1)arrows(end*0.15, hmax/3.2, cv*0.99, hmax*0.01, length =0.08) ```------------------------------------------------------------------------```{r, fig.align='center', fig.height=6, fig.width=9, fig.cap="4.10"}#| code-fold: true#| label: fig-Chidistribution#| fig-cap: "Chi-distribution"# Plot the ChiSq(3) Distribution on interval [-5,5]df <-3a <-0.05end <- df*5hmax <-dchisq( qchisq(0.3, df), df) # not really hmaxt <-seq(0, end, 0.01)cv <-round( qchisq(1-a, df), digits =2 )plot(x = t, y =dchisq(t, df), type ="l", col ="black", lwd =2, main =bquote("Test with ChiSq("~ .(df) ~") distribution and"~ alpha ~"="~ .(a)),ylab="", xlab="", cex.lab =1, cex.main =1.5, axes=FALSE)axis(side=1, at=0:end, pos =c(0,0) )axis(side=2) # critical region in right tailpolygon(x =c(cv, seq(cv, end, 0.01), end),y =c( 0, dchisq(seq(cv, end, 0.01), df), 0 ), col ='cornflowerblue')text(end*0.75, hmax/2, labels =bquote(alpha ~"="~ .(a)),cex =1)arrows(end*0.75, hmax/2.2, end*0.65, hmax*0.05, length =0.1)text(end*0.15, hmax/2.8, labels =bquote( qchisq[0.95] ~" = "~ .(cv)), cex =1)arrows(end*0.15, hmax/3.2, cv*0.99, hmax*0.01, length =0.08) ```------------------------------------------------------------------------### Baseball example continued##### **We estimate the baseball player model; unrestricted version**```{r, message=FALSE}#| comment: " "library(AER); library(wooldridge); data("mlb1")unr <-lm( log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr, data = mlb1)summary(unr)```- Note that **none** of *performance variables* is statistically significant------------------------------------------------------------------------##### **We estimate the baseball player model, this time in its restricted version**```{r}#| comment: " "r <-lm( log(salary) ~ years + gamesyr, data = mlb1)summary(r)```------------------------------------------------------------------------##### **We calculate the F-statistic from @eq-Fstat**```{r}#| comment: " "SSRur <-sum( unr$residuals^2 )SSRr <-sum( r$residuals^2 )df <-summary(unr)$df[2]Fstat <- (SSRr - SSRur)/SSRur*df/3Fstat# 95% critical valueqf(0.95, 3, df)# p-value1-pf(Fstat, 3, df)```- The $H_0$ that *none* of the performance variables have any impact on the salary of the players is strongly rejected, although the variables on its own right are not significant -- multicollinearity problem- Compare @fig-Fdistribution------------------------------------------------------------------------##### **We carry out the same test with `lht()` (for linearHypothesis), contained in the `AER` package**- Applying this procedure we notice that actually, it is not really necessary to estimate the restricted model to carry out the test. We only need the unrestricted model, as it was the case with the simple t-test```{r}#| comment: " "lht(unr, c("bavg=0", "hrunsyr=0", "rbisyr=0"))```- We get the very same results as with the F-test carried out by hand------------------------------------------------------------------------##### **We calculate the W-statistic from @eq-Wstat; the chi squared version**```{r}#| comment: " "sig2ur <- SSRur/df Wstat <- (SSRr - SSRur)/sig2urWstat# 95% critical valueqchisq(0.95,3)# p-value1-pchisq(Wstat, 3)```------------------------------------------------------------------------##### **The same test with `lht()`, contained in the `AER` package; chi squared version**```{r}#| comment: " "lht(unr, c("bavg=0", "hrunsyr=0", "rbisyr=0"), test ="Chisq")```- We get the same result as doing the testing by hand and nearly the same result as with the F-statistic version- Compare @fig-Chidistribution------------------------------------------------------------------------### "the" F-statistic of a regression model {#sec-theFstat}A special case of the F-test is a test of the **overall significance** of a regression with the$$H_0: \beta_1 = \beta_2 = \ldots = \beta_k = 0$$This null hypothesis states that the explanatory variables *overall* are *not useful* in explaining the dependent variable- This test of *overall significance* is routinely calculated and reported in most regression packages as "the" F-statistic; the null hypothesis is usually overwhelmingly rejected- If this $H_0$ cannot be rejected, the model is of *no value* in explaining the dependent variable------------------------------------------------------------------------### Additional examplesAs a last **example** in this chapter we will re-estimate our wage equation$$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 tenure + u$$- We will test whether the effect of one additional year of education has a threefold effect than the effect of one additional year of tenure; $H_0: \beta_1 = 3 \beta_3$ - As this hypothesis involves only one restriction, a t-test could be used. However, the restriction affects two parameters, therefore, a reparameterization of the model with a variable transformation would be necessary to implement this restriction - Using the R function `lht()` (*for linearHypothesis*) of the `car` and `AER` package implements this test very easily------------------------------------------------------------------------```{r, message=FALSE}#| comment: " "library(AER); library(wooldridge); data("wage1")# we first estimate the modelsummary( out <-lm(lwage ~ educ + tenure + exper, data = wage1) )```- Note "the" F-statistic!------------------------------------------------------------------------- We test the H0: beta1 = 3 beta3```{r}#| comment: " "lht( out, "educ = 3*tenure") ```------------------------------------------------------------------------- We additionally test whether beta1 = 0.1- If we have several restrictions, the restrictions must be put into a character vector with the 'c()' function```{r}#| comment: " "lht( out, c("educ = 3*tenure", "educ=0.1") )```- These two restrictions together on the model are clearly not compatible with the data anymore------------------------------------------------------------------------## Tests in matrix notation::: {.callout-caution collapse="false" icon="false"}## Wald test statistic with general linear hypothesis in matrix notationEvery set of linear hypothesis can be stated as:$$\mathbf R \boldsymbol{\hat{\beta}} = \mathbf c$$ {#eq-restrict}Thereby, the restriction matrix $\mathbf R$ has dimensions *m* x (*k*+1), with *m* being the number of restrictions. Consider the following example:$$\underbrace{\left[\begin{array}{cccc}0 & 0 & 1 & 0 \\0 & 0 & 0 & 1\end{array}\right]}_{R}\left[\begin{array}{c}\hat{\beta}_{0} \\\hat{\beta}_{1} \\\hat{\beta}_{2} \\\hat{\beta}_{3}\end{array}\right] = \underbrace{\left[\begin{array}{c}0 \\0\end{array}\right]}_{c}$$Multiplying out we see that, in this case, we test whether $\hat \beta_2=0$ *and* $\hat \beta_3=0$.*We have the following important Theorem*:::: {#thm-distrib_quadratic_form}## Distribution of a quadratic formLet $\mathbf z$ be a $N(\mathbf 0, \mathbf V)$ normal distributed random vector with *h* elements. Then the quadratic form$$\mathbf z' \mathbf V^{-1} \mathbf z$$is distributed as$$\mathbf z' \mathbf V^{-1} \mathbf z \ \sim \ \chi^2(h)$$ (see Appendix A.6 and A.9).:::------------------------------------------------------------------------- Assuming that $\sigma^2$ is a *known* constant and applying covariance formula from @eq-C_covmatrix_lin_combination to the restriction vector $(R \boldsymbol{\hat{\beta}} - \mathbf c)$, we get the covariance matrix of this random vector:$$\sigma^{2} \mathbf{R} \left( \mathbf{X}' \mathbf{X} \right)^{-1} \mathbf R'$$ {#eq-eq-restrictCovMat}- As $\boldsymbol{\hat \beta}$ is asymptotically normally distributed (or normally distributed in small samples if $\mathbf u$ is normally distributed), provided $H_0$ is true, the restriction vector is$$(R \boldsymbol{\hat{\beta}} - \mathbf c) \ \sim \ N(\mathbf 0, \sigma^{2} \mathbf{R} \left( \mathbf{X}' \mathbf{X} \right)^{-1} \mathbf R')$$- Applying @thm-distrib_quadratic_form from above we conclude that the following must hold:$$W \ = \ \underbrace{(\mathbf{R} \hat{\boldsymbol{\beta}}-\mathbf{c})^{\prime}}_{z^{\prime}} \underbrace{\left(\sigma^{2} \mathbf{R}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{R}^{\prime}\right)^{-1}}_{V^{-1}} \underbrace{(\mathbf{R} \hat{\boldsymbol{\beta}}-\mathbf{c})}_{z} \ \sim \ \chi^{2}(m) $$ {#eq-WaldStatMat}This Wald test statistic is a weighted sum (by the inverse of the covariance matrix) of the distance between the hypothesized values $\mathbf c$ and the actual estimation outcomes $\mathbf R \boldsymbol{\hat \beta}$. If this normed distance is too large, $H_0$ is rejected.For *m* = 1 (only one restriction), $\sqrt W$ reduces to a simple *t*-statistic.------------------------------------------------------------------------If $\sigma^2$ is *unknown*, it has to be estimated by $\hat \sigma^2={\sum_{i=1}^n \hat u_i^2} \ / \ {(n-k-1)}$.- With the help of the above theorem, it can be shown that this estimator (after scaling with $\sigma^2$) is distributed as $\chi^2(n-k-1)$- Hence, in this case, the Wald-statistic @eq-WaldStatMat is the ratio of two $\chi^2$, independently distributed random variables and therefore is *F*-distributed after appropriate scaling (see Appendix A.7):$$F \ = \ \frac{(\mathbf R \boldsymbol{\hat{\beta}} - \mathbf c)' \left(\hat\sigma^{2} \mathbf R\left(\mathbf X' \mathbf X\right)^{-1} \mathbf R'\right)^{-1}(\mathbf R \boldsymbol{\hat{\beta}} - \mathbf c)}{m} \ \sim \ F(m,n-k-1) $$ {#eq-FStatMat}Note, in *small* samples this formula is valid only if $\mathbf u$ is normally distributed. In *large* samples, @eq-FStatMat times *m* converges to @eq-WaldStatMat and so the two test statistics are asymptotically equivalent.- Furthermore, it can be shown that the formula above is equivalent to:$$F \ = \ \frac{\left(\hat{\mathbf{u}}_{r}^{\prime} \hat{\mathbf{u}}_{r}-\hat{\mathbf{u}}_{ur}' \hat{\mathbf{u}}_{ur} \right) / m}{\hat{\mathbf{u}}'_{ur} \hat{\mathbf{u}}_{ur} /(n-k-1)} \ \sim \ F(m, n-k-1)$$ {#eq-FStatMatAlternative}The interpretation of this test-statistic is straightforward: If the percentage decrease of the fit (measured by *SSR*) of the restricted model is too large compared to the model without imposing the restrictions, the restricted model is rejected.:::