rm(list = ls()) # delete all
library(AER)
library(wooldridge)
library(marginaleffects)
library(ggplot2)
library(effects)
library(modelsummary)
data("attend")
5.1 Topics of this chapter
Interaction terms
-
Dummy variables
- Binary dummies
- Interaction with dummy variables
- Treatment effects and the Difference in Difference Estimator
- Chow test
- Dummies for multiple characteristics - categorical data
Binary variables as dependent variable – the linear probability model (LPM )
-
Prediction errors
- Variance and confidence intervals for predictions
- Studentized residuals
5.2 Interaction terms
In Section 2.9.4, we discussed a model, where unobservable factors affect the slope parameters \boldsymbol \beta – this was the random coefficient model, which poses no real additional problems to estimation (besides heteroscedasticity).
In this section we discuss the influence of the level of a observably variable, x_2, on the partial effect (the slope) of x_1 on y. This is modeled with interaction terms, which are the product of the two (or more) regressors:
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + u \tag{5.1}
- So, if you want the partial effect of x_1 on y we have
\dfrac {\partial y}{\partial x_1} = \beta_1 + \beta_3 x_2 \tag{5.2}
- Clearly, this partial effect depends on the level of x_2. Note that \beta_1 is the the partial effect of x_1 only if x_2 = 0, which may be not the case for any observation in the sample
A reparametrization
- A more reasonable parametrization is:
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 - \bar {x}_1) (x_2 - \bar {x}_2) + u \tag{5.3}
Now, \beta_1 is the partial effect of x_1 on y if (x_2 - \bar {x}_2) = 0, which means that x_2 has the value of is own mean. The same is true for the partial effect of x_2, which equals \beta_2 if x_1 = \bar {x}_1
-
This parametrization has several advantages
Easy interpretation of all parameters
We have standard errors for partial effects at the mean values available
If necessary, interaction may be centered at other interesting values
Average partial effects
In models with quadratic, interactions, and other nonlinear functional forms, the partial effect depends on the values of one or more explanatory variables
-
The Average Partial Effect (APE) is a summary measure to describe the relationship between dependent variable and each explanatory variable
- After computing the partial effect for every unit i by plugging in the estimated parameters and the values for the variable x_{i,2} into Equation 5.2, average these individual partial effects across the units of the sample. For our previous example we have: \widehat {APE} \, = \, \frac {1}{n}\sum_{i=1}^n (\hat \beta_1 + \hat \beta_3 x_{i,2}) \, = \, \hat \beta_1 + \hat \beta_3 \frac {1}{n}\sum_{i=1}^n x_{i,2} \, = \, \hat \beta_1 + \hat \beta_3 \bar x_{i,2} \tag{5.4}
- Below, an example shows this is the same as \beta_1 in Equation 5.3 1
In R, the APE is easily computed with the
avg_slopes()
function from themarginaleffects
package
5.2.1 Interaction terms - Example
In the following example from Wooldridge (2019) we want to estimate the effect of visiting a lecture for microeconomics, measured by the attendance rate atndrte
(% of visiting the lecture), on the success for the final test, measured by stndfnl
, a scaled test score (mean = 0, sd = 1)
- We estimate the following model:
stndfnl = \beta_0 + \beta_1 \, atndrte + \beta_2 \, priGPA + \beta_3 ACT + \beta_4 \, priGPA^2+ \beta_5 ACT^2 \\ + \beta_6 \, atndrte * priGPA + u
We have two control variables:
priGPA
(prior GPA result) andACT
(another test result). These should control for the general ability of the students-
We further want to know, whether good students benefits more or less from attendance the lecture. This is measured by the interaction variable
atndrte
\, * \,priGPA
:If this variable has a positive coefficient, then more capable students (with a higher
priGPA
) exhibit a higher partial effect of attendance rate und thus benefit more from visiting the lecture \frac {\partial stndfnl}{\partial atndrte} = \beta_1 + \beta_6 \, priGPAThe interpretation of the coefficient of
atndrte
, \beta_1, taken for is own is the partial effect ofatndrte
at a value of 0 forpriGPA
(which nobody in the sample has)
Furthermore, we allow for additional nonlinear (quadratic) effects of the control variables
In an R formula, with the formulation (atndrte
* priGPA
), the variables atndrte
, priGPA
and the interaction variable atndrte
: priGPA
are automatically generated. The function I()
is necessary whenever the function inside I()
should be taken literally
The results are:
Code
# you can simply use summary(myres) to show the output
# summary(myres)
# or more pretty:
modelsummary(list("Model 1"=myres),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "A|L|B|F",
coef_omit = "Interc",
align = "ldddddd",
fmt = 4,
output = "gt")
Model 1 | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
atndrte | -0.0067 | 0.0102 | -0.6561 | 0.5120 | -0.0268 | 0.0134 |
priGPA | -1.6285*** | 0.4810 | -3.3857 | 0.0008 | -2.5730 | -0.6841 |
ACT | -0.1280 | 0.0985 | -1.3000 | 0.1940 | -0.3214 | 0.0653 |
I(priGPA^2) | 0.2959** | 0.1010 | 2.9283 | 0.0035 | 0.0975 | 0.4943 |
I(ACT^2) | 0.0045* | 0.0022 | 2.0829 | 0.0376 | 0.0003 | 0.0088 |
atndrte × priGPA | 0.0056 | 0.0043 | 1.2938 | 0.1962 | -0.0029 | 0.0141 |
Num.Obs. | 680 | |||||
R2 | 0.229 | |||||
RMSE | 0.87 | |||||
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
\beta_1 = -0.0067 is the partial effect for atndrte
at priGPA
= 0, which is quite irrelevant and also has the wrong sign (attendees of a lecture should be beneficial)
APE
Now, we want to calculate the APE of atndrte
at the mean of priGPA
, which is 2.587 (Equation 5.4). Remember, we stored the regression coefficients in the vector b
# estimate partial effect at the mean of priGPA = 2.586775
b["atndrte"] + mean(attend$priGPA) * b["atndrte:priGPA"]
atndrte
0.007736558
This partial effect is positive (as expected) and also quite high. It means that if attendance rate change from 0% to 100% for an average student, the test score improves by 77% of the standard deviation of the test score (remember,
stndfnl
is scaled to mean zero and standard deviation = 1)Is this effect statistically significant? The following test says yes:
# test partial effect
lht(myres, "atndrte + 2.586775 * atndrte:priGPA = 0" )
Linear hypothesis test
Hypothesis:
atndrte + 2.586775 atndrte:priGPA = 0
Model 1: restricted model
Model 2: stndfnl ~ atndrte * priGPA + ACT + I(priGPA^2) + I(ACT^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 674 519.34
2 673 512.76 1 6.5786 8.6344 0.003412 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As always with a F-test with only one restriction, the t-statistic is \sqrt F = \sqrt {8.6344} = 2.9384
Using the procedure avg_slopes()
from marginaleffects
package, the APE (or AME) should be the same as the calculated value above, and this is the case.
# checking results with "avg_slopes()":
m <- avg_slopes(myres)
# You can simply use m to show the output
m
Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
ACT mean(dY/dX) 0.07606 0.01120 6.79 <0.001 36.4 0.05411 0.0980
atndrte mean(dY/dX) 0.00774 0.00263 2.94 0.0033 8.2 0.00258 0.0129
priGPA mean(dY/dX) 0.35876 0.07779 4.61 <0.001 17.9 0.20630 0.5112
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
Type: response
# But more pretty is the following with `modelsummary`
Code
modelsummary(list("APE for model 1"=m),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "",
align = "ldddddd",
fmt = 4,
output = "gt")
APE for model 1 | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
ACT | 0.0761*** | 0.0112 | 6.7913 | <1e-04 | 0.0541 | 0.0980 |
atndrte | 0.0077** | 0.0026 | 2.9385 | 0.0033 | 0.0026 | 0.0129 |
priGPA | 0.3588*** | 0.0778 | 4.6121 | <1e-04 | 0.2063 | 0.5112 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
avg_slopes()
calculated the APEs (or AMEs) for all variables and additionally carried out significance test and calculated confidence intervals.
The APE for atndrte
is the same as the one calculated “by hand” above.
- The prob-values are slightly different, because
avg_slopes()
uses the normal distribution instead of the t-distribution (asymptotic test, therefore the label z-value is used instead of t-value)
With the plot_predictions()
function of the marginaleffects
package, the effects of attendance rate, subject to different values of pirGPA
, can be displayed: more capable students seem to benefit more.
plot_predictions(myres, condition = list("atndrte", "priGPA"=seq(1,3.5,0.5))) +
facet_wrap(~ priGPA) +
theme_bw()
atndrte
on stndfnl
, by priGPA
- However, the effect is not significant - look for Est. and t-value of the interaction term in the regression Table 5.1
Reparametrization
We estimate the same model once more, but this time with a reparametrization: The interaction term now is atndrte * (priGPA - \overline{priGPA})
-
With this parametrization, the partial effect of
atndrte
, \beta_1, is measured at the mean ofpriGPA
, which therefore corresponds to the APE, see Equation 5.3- In R, de-meaning a variable is done with the function
scale
(variable
,scale
=FALSE). Without the optionscale
=FALSE, thescale
function does not only subtract the mean but also divides by the standard deviation
- In R, de-meaning a variable is done with the function
# estimation of the model, this time with de-meaned priGPA in the interaction term so,
# the interpretation of the coef of atndrte is the partial effect of atndrte
# at the mean of priGPA
myres1 <- lm(stndfnl ~ atndrte + priGPA + ACT + I(priGPA^2) + I(ACT^2) +
atndrte:scale(priGPA, scale = FALSE), data = attend)
As the following regression output shows, the estimated APE for atndrte
and the t-value exactly correspond to the values we got with the avg_slopes()
function in Table 5.2.
Code
# summary(myres1)
modelsummary(list("Model 2"=myres1),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "A|L|B|F",
coef_omit = "Interc",
align = "ldddddd",
fmt = 4,
output = "gt")
Model 2 | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
atndrte | 0.0077** | 0.0026 | 2.9384 | 0.0034 | 0.0026 | 0.0129 |
priGPA | -1.6285*** | 0.4810 | -3.3857 | 0.0008 | -2.5730 | -0.6841 |
ACT | -0.1280 | 0.0985 | -1.3000 | 0.1940 | -0.3214 | 0.0653 |
I(priGPA^2) | 0.2959** | 0.1010 | 2.9283 | 0.0035 | 0.0975 | 0.4943 |
I(ACT^2) | 0.0045* | 0.0022 | 2.0829 | 0.0376 | 0.0003 | 0.0088 |
atndrte × scale(priGPA, scale = FALSE) | 0.0056 | 0.0043 | 1.2938 | 0.1962 | -0.0029 | 0.0141 |
Num.Obs. | 680 | |||||
R2 | 0.229 | |||||
RMSE | 0.87 | |||||
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Code
modelsummary( list(myres, myres1),
output="gt",
statistic = "statistic",
coef_omit = "Inter",
gof_omit = "A|B|L|F",
align = "ldd",
stars = TRUE,
fmt = 4)
(1) | (2) | |
---|---|---|
atndrte | -0.0067 | 0.0077** |
(-0.6561) | (2.9384) | |
priGPA | -1.6285*** | -1.6285*** |
(-3.3857) | (-3.3857) | |
ACT | -0.1280 | -0.1280 |
(-1.3000) | (-1.3000) | |
I(priGPA^2) | 0.2959** | 0.2959** |
(2.9283) | (2.9283) | |
I(ACT^2) | 0.0045* | 0.0045* |
(2.0829) | (2.0829) | |
atndrte × priGPA | 0.0056 | |
(1.2938) | ||
atndrte × scale(priGPA, scale = FALSE) | 0.0056 | |
(1.2938) | ||
Num.Obs. | 680 | 680 |
R2 | 0.229 | 0.229 |
RMSE | 0.87 | 0.87 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Hence, as argued above, this reparametrization directly delivers the APE
We check with avg_slopes
() from the marginaleffects
package and once more get the same results
# checking with "avg_slopes": calculate APE
m1 <- avg_slopes(myres1)
m1
Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
ACT mean(dY/dX) 0.07606 0.01120 6.79 <0.001 36.4 0.05411 0.0980
atndrte mean(dY/dX) 0.00774 0.00263 2.94 0.0033 8.2 0.00258 0.0129
priGPA mean(dY/dX) 0.35876 0.07779 4.61 <0.001 17.9 0.20630 0.5112
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
Type: response
Diagrams of non-linear effects
Finally, we want to plot the non-linear effects of priGPA
and ACT
at the means of all other variables.
-
Both have quadratic terms in the model and
priGPA
additionally an iteration withatndrte
. So, calculating the effects by hand would be quite tedious. Fortunately, the functionplot_predictons()
of themarginaleffects
package does the work for usWith
plot_slopes()
, you get the marginal effect (slope) of the variable, like in Equation 5.2With
plot_predictons()
, you get the effect of the variable on the expected value of the dependent variable
plot_predictions(myres1, condition="priGPA") + theme_bw()
plot_slopes(myres1, variables="priGPA", condition=list("priGPA", "atndrte"=c(0, 50, 100)) ) + theme_bw()
priGPA
on test scorepriGPA
on test score, dependent on atndrte
. Both a higher priGPA
and atndrte
increase the marginal effect of priGPA
on the test scoreplot_predictions(myres1, condition="ACT") + theme_bw()
plot_slopes( myres1, variables="ACT", condition="ACT" ) + theme_bw()
ATC
on test scoreATC
on test score5.3 Dummy variables
Dummy variables contain qualitative information which are typically coded as 1 or 0.
-
Examples for qualitative information are
Cross section data: gender, race, family background, industry, region, rating grade, poeple who got a special treatment or not, …
Times series data: seasonality, special events like strikes or natural catastrophes, time before and after a certain date / policy intervention, …
-
Dummy variables may appear as the dependent or as independent variables
Example for a dependent dummy variable: InLaborForce = \beta_0 + \beta_1 FamilyIncome + \beta_2 NumOfChildrens + \ \cdots \ + u Here,
InLaborForce
is a dummy variable with value equal 1 if the person (woman) is in the labor force (working), and a value equal 0 if the person is not. Such cases are not treated in this section, but in a later one about the linear probability model-
Example for an independent dummy variable (which are treated in this section): wage = \beta_0 + \beta_1 female + \ \cdots \ + u Here,
female
is a dummy variable with value equal 1 if the person is a woman, and a value equal 0 if the person is a man- The parameter \beta_1 represents the gain/loss in wages, if the person is a woman rather than a men (holding other things, which are controlled for, fixed)
Example data
wage | educ | exper | tenure | nonwhite | female | married | numdep | smsa | northcen | south | west | construc | ndurman | trcommpu | trade | services | profserv | profocc | clerocc | servocc | lwage | expersq | tenursq |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3.10 | 11 | 2 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.131402 | 4 | 0 |
3.24 | 12 | 22 | 2 | 0 | 1 | 1 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1.175573 | 484 | 4 |
3.00 | 11 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1.098612 | 4 | 0 |
6.00 | 8 | 44 | 28 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1.791759 | 1936 | 784 |
5.30 | 12 | 7 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.667707 | 49 | 4 |
8.75 | 16 | 9 | 8 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2.169054 | 81 | 64 |
11.25 | 18 | 15 | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2.420368 | 225 | 49 |
5.00 | 12 | 5 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1.609438 | 25 | 9 |
3.60 | 12 | 26 | 4 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1.280934 | 676 | 16 |
18.18 | 17 | 22 | 21 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2.900322 | 484 | 441 |
Example - Wage equation
We estimate model explaining log(wage) with education and gender
out <- lm(lwage ~ female, data = wage1)
Code
modelsummary(list("Wage dependent on gender "=out),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "AI|L|F|B",
align = "ldddddd",
output = "gt")
Wage dependent on gender | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
(Intercept) | 1.814*** | 0.030 | 60.830 | <0.001 | 1.755 | 1.872 |
female | -0.397*** | 0.043 | -9.222 | <0.001 | -0.482 | -0.313 |
Num.Obs. | 526 | |||||
R2 | 0.140 | |||||
R2 Adj. | 0.138 | |||||
RMSE | 0.49 | |||||
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Interpretation of results of this example
The intercept, 1.814, is the average log(wage) of men (as there are no other variables in the model)
-
The coefficient of female shows the difference in the averages of log(wages) of men and woman
A value of -0.397 means that women have a lower log(wage) of 0.397 than men. Hence, a woman earns on average about 40% less
As there are no other control variables in the model, this results are with regard to average persons, which differ in their gender. The difference in the means is statistically highly significant
We can get a similar information with a Boxplot
Boxplot
boxplot( wage1$lwage ~ wage1$female )
The previous regression show a statistical significant difference in earned wages between man and woman
Does this mean that woman are discriminated?
Not really, because the difference could be due to objective and relevant factors like education, experience, tenure, occupation and so on. Omission of these factors probably leads to an omitted variable bias. As we have data on some of these factors, we can control for them
Hence, we estimate a more elaborate version the following model
log(wage) \ = \ \beta_0 + \beta_1 female + \beta_2 educ + \beta_3 exper + \beta_4 exper^2 + \\ \beta_5 tenure + \beta_6 tenure^2 + u
Code
modelsummary(list("Wage dependent on gender and many control variables"=out1),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "AI|L|F|B",
align = "ldddddd",
output = "gt")
Wage dependent on gender and many control variables | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
(Intercept) | 0.417*** | 0.099 | 4.212 | <0.001 | 0.222 | 0.611 |
female | -0.297*** | 0.036 | -8.281 | <0.001 | -0.367 | -0.226 |
educ | 0.080*** | 0.007 | 11.868 | <0.001 | 0.067 | 0.093 |
exper | 0.029*** | 0.005 | 5.916 | <0.001 | 0.020 | 0.039 |
I(exper^2) | -0.001*** | 0.000 | -5.431 | <0.001 | -0.001 | 0.000 |
tenure | 0.032*** | 0.007 | 4.633 | <0.001 | 0.018 | 0.045 |
I(tenure^2) | -0.001* | 0.000 | -2.493 | 0.013 | -0.001 | 0.000 |
Num.Obs. | 526 | |||||
R2 | 0.441 | |||||
R2 Adj. | 0.434 | |||||
RMSE | 0.40 | |||||
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Interpretation of results
The
intercept
is the log(wage) of men (basis category) with zero education, zero, experience and zero tenure and is about 0.417. Note, how large the difference is to the first version without controls (Table 5.5), where the results are with regard of an average man with some education, some experience and tenure-
The coefficient of
female
shows the difference in the log(wages) of men and woman-
A value of -0.297 means that a woman with the same level of education, experience and tenure as a man earns on average about 30% less.
- But note, that the formulation of the model presumes that the effect of gender on wages is equal for all individuals. For a more general formulation, slope effects via interaction variables have to be take into account; see next section
-
One year more of education leads to a higher wage of 8%, independent of gender
-
The effects of experience and tenure are non-linear modeled, so the partial effects depend on the corresponding levels of the variables and are therefore hard to interpret
- But as the coefficients of the linear terms are positive and the coefficients of the quadratic ones are negative, we can guess that the effects decline with higher experience and tenure. As we already know, we can use the function
avg_slope()
of the marginaleffectspackage to calculate the APEs, and the function
plot_predictions()` to plot the non-linear relationships
- But as the coefficients of the linear terms are positive and the coefficients of the quadratic ones are negative, we can guess that the effects decline with higher experience and tenure. As we already know, we can use the function
Note, that the Adjusted R-squared is now 0.43 compared to 0.14 in the model without controls. Hence, education, experience and tenure are indeed very important factors of wage earnings
5.4 Interactions with dummy variables
So far, the specification of the model contains one dummy variable for gender. But this can only explain a ceteris paribus difference in the level of wage earnings for woman but do not allow that the payoffs (slopes) for education, experience and tenure depend on gender. However, these slopes could change with gender as well, leading to the fact that gender effects are not equal for all individuals
If one wants to test for the general effects of gender on wage earnings these slope effects have to be taken into account. This can be done with interaction terms for all variables with the female dummy
In R, this can be very easily accomplished with the following notation using ” * ” lwage \ \sim \ female * ( \ educ + exper + I(exper^2) + tenure + I(tenure^2) \ ) With this formulation, all variables within the brackets are included, plus all interaction terms for every variable (including the intercept) with
female
After estimation, one can test whether the coefficients of all terms involving
female
(6 in our case) are jointly zero with an usual F-test. If H0 is rejected, gender has a significant effect an wage earningsThe difficulty with this general formulation is that it is not so easy to say how large the wage differential actual is. The reason is, there may be a negative effect in the intercept but different effects on the slopes (payoffs) of education, experience or tenure. Therefore, you have to relay once again on APEs, i.e., you have to calculate an average effect over all individuals with
avg_slopes()
Code
modelsummary(list("Wage dependent on gender and many control variables"=out2),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "AI|L|F|B",
align = "ldddddd",
output = "gt")
Wage dependent on gender and many control variables | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
(Intercept) | 0.215+ | 0.129 | 1.659 | 0.098 | -0.040 | 0.469 |
female | 0.108 | 0.193 | 0.561 | 0.575 | -0.271 | 0.487 |
educ | 0.087*** | 0.009 | 9.881 | <0.001 | 0.070 | 0.104 |
exper | 0.040*** | 0.007 | 5.708 | <0.001 | 0.027 | 0.054 |
I(exper^2) | -0.001*** | 0.000 | -4.955 | <0.001 | -0.001 | 0.000 |
tenure | 0.033*** | 0.009 | 3.696 | <0.001 | 0.015 | 0.050 |
I(tenure^2) | -0.001* | 0.000 | -1.981 | 0.048 | -0.001 | 0.000 |
female × educ | -0.014 | 0.014 | -1.034 | 0.302 | -0.041 | 0.013 |
female × exper | -0.023* | 0.010 | -2.340 | 0.020 | -0.042 | -0.004 |
female × I(exper^2) | 0.000+ | 0.000 | 1.794 | 0.073 | 0.000 | 0.001 |
female × tenure | 0.007 | 0.015 | 0.451 | 0.652 | -0.022 | 0.036 |
female × I(tenure^2) | -0.001 | 0.001 | -1.443 | 0.150 | -0.002 | 0.000 |
Num.Obs. | 526 | |||||
R2 | 0.461 | |||||
R2 Adj. | 0.450 | |||||
RMSE | 0.39 | |||||
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Interpretation of the coefficients
- The coefficients (slopes) without
female
are the ones relevant for males, whereas the coefficients withfemale
show the difference of the slopes for woman compared to man. Therefore, to get the effects for woman, you have to add the relevant coefficients. For instance, the marginal effect (first partial derivative) of experience on woman is:
\beta_4 + 2 \beta_5 exper + \beta_9 female + 2\beta_{10} female \cdot exper
-
Clearly, this marginal effect depends on the values of experience and gender. Use the R function
avg_slopes()
to calculate an average effect- An alternative would be to recenter (demean) the interaction variables before estimation, as discussed in the section “Interaction terms” above
To test whether gender has an overall effect on wage earnings (intercept and/or slopes) we carry out an F-test. Under the H0, all interaction effect are zero, i.e., the same regression coefficients apply to men and women
With the help of the R function
matchCoef()
it is easy to choose all variables which containfemale
in their name
Test, whether all variables which contain female
are jointly insignificant
As expected, the H0 is clearly rejected:
lht( out2, matchCoefs(out2, "female") )
Linear hypothesis test
Hypothesis:
female = 0
female:educ = 0
female:exper = 0
female:I(exper^2) = 0
female:tenure = 0
female:I(tenure^2) = 0
Model 1: restricted model
Model 2: lwage ~ female * (educ + exper + I(exper^2) + tenure + I(tenure^2))
Res.Df RSS Df Sum of Sq F Pr(>F)
1 520 93.911
2 514 79.920 6 13.991 14.997 7.654e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The APEs for the variables
Code
m2 <- avg_slopes(out2)
modelsummary(list("APEs"=m2),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "",
align = "ldddddd",
output = "gt")
APEs | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
educ | 0.080*** | 0.007 | 11.711 | <0.001 | 0.067 | 0.093 |
exper | 0.010*** | 0.002 | 5.164 | <0.001 | 0.006 | 0.013 |
female | -0.314*** | 0.036 | -8.790 | <0.001 | -0.384 | -0.244 |
tenure | 0.027*** | 0.005 | 5.286 | <0.001 | 0.017 | 0.037 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
The APE for
female
shows the overall wage differential for two average persons which only differ in gender. The effect is about -31%If you explicitly want the average differences in slope effects for an average man and an average woman you have to average over men and woman separately, which is shown next
Code
# in averaging, we only consider man
m3 <- avg_slopes(out2, newdata = subset(female==0))
modelsummary(list("APEs"=m3),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "",
align = "ldddddd",
output = "gt")
APEs | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
educ | 0.087*** | 0.009 | 9.882 | <0.001 | 0.070 | 0.104 |
exper | 0.014*** | 0.003 | 5.222 | <0.001 | 0.009 | 0.019 |
female | -0.342*** | 0.038 | -9.036 | <0.001 | -0.417 | -0.268 |
tenure | 0.025*** | 0.006 | 4.491 | <0.001 | 0.014 | 0.036 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
In this case, female
tells us, how much an average man would loose if he were a woman. Look also at the slope effects of educ
, tenure
and experience
, which are now the marginal effects for an average man and not an average person.
We make the same for an average woman, by averaging only over woman:
Code
# in averaging, we only consider woman
m4 <- avg_slopes(out2, newdata = subset(female==1))
modelsummary(list("APEs"=m4),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "",
align = "ldddddd",
output = "gt"
)
APEs | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
educ | 0.073*** | 0.011 | 6.857 | <0.001 | 0.052 | 0.093 |
exper | 0.005+ | 0.003 | 1.943 | 0.052 | 0.000 | 0.010 |
female | -0.283*** | 0.036 | -7.902 | <0.001 | -0.353 | -0.213 |
tenure | 0.029*** | 0.009 | 3.312 | <0.001 | 0.012 | 0.046 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Now, female
tells us, how much an average woman would gain if she were a man. Why is this different?
Here is a graphical representation of all these estimates of the APEs’, subject to gender:
plot_slopes(out2, variables=c("educ", "exper", "tenure", "female"), by="female" ) + theme_bw()
5.5 Treatment effects and the DiD estimator
Another important application of interaction terms is the evaluation of treatment effects of so called quasi or natural experiments. 2
As an example suppose we want to estimate the effect of a medical treatment on the outcome of a disease
-
To do this we need two groups of patients, treated and untreated, randomly assigned, which we observe at least for two periods – the period before and after treatment
-
To measure the progress of the disease for the treated patients alone is not sufficient because other effects could be in play which effects treated and untreated (e.g., placebo effect in this example, or general trends)
- To control for these, one strategy is to compare the outcome of the disease for the treated patients with the outcome of the untreated patients over time.
By focusing on the differece of changes in y over the course of the experiments, we can control for some pre-treatment differences (e.g., not a really random assignment, which affects only the level of the outcome and not the trend) and for factors which affect treated and untreated equally
- To control for these, one strategy is to compare the outcome of the disease for the treated patients with the outcome of the untreated patients over time.
So we examine the difference in the clinical status of the treated patients form period one to period two and compare this with the difference in the clinical status of the untreated patients form period one to period two. The difference of these two differences leads to the treatment effect looked for. Therefore the name difference in difference (DiD) estimator
-
Example – house prices
Let us examine another example with actual data.
In 1981, a garbage incinerator (Müllverbrennungsanlage) was built in North Andover, Massachusetts
We have a pooled cross-section data set of prices for sold houses for the years 1978 and 1981 and want to examine whether this building negatively affected the house prices for houses nearby (within 3 miles) the incinerator (this is the “treatment” group)
The following R code carries out an DiD estimator for this example:
library(wooldridge)
# Data set "hprice3" of the Wooldridge package
data(hprice3)
# Defining the dummy variable "nearinc" from the variable "dist",
# which is the distance from the incinerator in feet; 15840 = 3 miles
nearinc <- as.numeric(0 + hprice3$dist<15840)
# Control group:
# Difference in the average log house prices "lprices" for houses
# not near the incinerator, from 78 to 81
D1 <- mean( hprice3$lprice[ hprice3$y81 == 1 & nearinc == 0 ] ) -
mean( hprice3$lprice[ hprice3$y81 == 0 & nearinc == 0 ] )
# Treatment group:
# Difference in the average log house prices "lprices" for houses
# near the incinerator, from 78 to 81 ()
D2 <- mean( hprice3$lprice[ hprice3$y81 == 1 & nearinc == 1 ] ) -
mean( hprice3$lprice[ hprice3$y81 == 0 & nearinc == 1 ] )
# Difference of the two changes -- classical DiD estimator
DiD <- D2-D1
Code
```{r}
#| comment: " "
#| code-fold: true
# Building strings for printing
paste( "D1 (difference of houseprices from 78 to 81, not near incin.) = ", round(D1, 3) )
paste( "D2 (difference of houseprices from 78 to 81, near incin.) = ", round(D2, 3) )
paste( "DiD (difference of difference, D2 - D1) = ", round(DiD, 3) )
```
[1] "D1 (difference of houseprices from 78 to 81, not near incin.) = 0.457"
[1] "D2 (difference of houseprices from 78 to 81, near incin.) = 0.394"
[1] "DiD (difference of difference, D2 - D1) = -0.063"
Result of the example above:
The incinerator leads to an estimated decrease of house prices of 6.3% for houses nearby the incinerator. (If other factors, which affect the control and treatment group differently plays no role – parallel trend assumption, which is essential for DiD estimations)
How does this work?
-
We are interested in the treatment effect on the variable Y which is defined as (Y_T^T[1] - Y_T^{NT}[1])
Thereby Y_T^T[1] is the value of Y of the treated group (subscript T) at period 1 after treatment (superscript T)
-
Y_T^{NT}[1] equals to the value of treated group at period 1, but hypothetically not treated
Hence, we compare the Y-value of the treated group after treatment (which we can observe) with the Y-value of the treated group under the hypothesis that this group had not been treated. But this is counterfactual and thus not observed
-
Therefore, we need an estimate of the hypothetical Y_T^{NT}[1]. And here comes the basic assumption of all DiD estimations – the parallel trend assumption:
- We assume that the change over time of the treated group would have been the same as the change of the untreated group, if the treated group had not been treated
Thus an estimate of Y_T^{NT}[1] is: \; \widehat {Y_T^{NT}[1]} \ = \ Y_T[0] + (Y_{NT}[1] - Y_{NT}[0]).
This this the Y-value of the treated group in period 0 (before all treatments) plus the change over time of the untreated group. We can plug this result into our definition of the treatment effect above and reach to:( Y_T^T[1] - \widehat {Y_T^{NT}[1]} ) \ = \ \underbrace {(Y_T^T[1] - Y_T[0])}_{D2} - \underbrace {(Y_{NT}[1] - Y_{NT}[0])}_{D1}
But this is the DiD estimator used in our example!
DiD estimator as interaction term
The same analysis, which was somewhat tedious, can be done in a more simple and flexible way by a regression with an interaction term 3
lprice = \beta_0 + \beta_1 \, nearinc + \beta_2 \, y81 + \beta_3 \, (y81 \times nearinc) + u
With
lprice
equals log(house prices)nearinc
is a dummy which is equal to one if the house is located near the incinerator (within 3 miles) and zero otherwise (this is the treatment dummy)y81
is a dummy variable which equals zero for year 1978 and with a value equal one for year 1981 (this is a time dummy)y81
*nearinc
is an interaction betweeny81
andnearinc
Thereby:
\beta_0 measures the price for houses not near the incinerator in the year 78;
y81
andnearinc
are both 0\beta_1 measures the general price difference for houses near the incinerator to those not near the incinerator, for the year 78;
y81
= 0\beta_2 measures the price difference from 78 to 81, for house not near the incinerator;
nearinc
= 0-
\beta_3 equals the classical DiD estimator: It measures the price difference over time for houses near minus the price difference over time for houses not near the incinerator
Why is this the case:
We have for the price increase not near the incinerator (control group):
(lprice_{81} \mid nearinc = 0) \ - \ (lprice_{78} \mid nearinc = 0) \ = \\ (\beta_0 + \beta_2) \ - \ \beta_0 \ = \ \beta_2We have for the price increase near the incinerator (treatment group):
(lprice_{81} \mid nearinc = 1) \ - \ (lprice_{78} \mid nearinc = 1) \ = \\ (\beta_0 + \beta_1 + \beta_2 + \beta_3) \ - \ (\beta_0 + \beta_1) \ = \ \beta_2 + \beta_3The DiD is the difference of the groups’ price changes over time:
[(lprice_{81} \mid nearinc = 1) \ - \ (lprice_{78} \mid nearinc = 1)] \ -
[(lprice_{81} \mid nearinc = 0) \ - \ (lprice_{78} \mid nearinc = 0)] \ =
(\beta_2 + \beta_3) \ - \ \beta_2 \ = \ \beta_3
Hence, the coefficient of the interaction term, \beta_3, is equal to the classical DiD estimator
On second thoughts, this result should not be too surprising. We know from Section 5.2 that interaction terms generally gives the difference in slopes compared with the reference category 4. In this case, we have the effect of the time dummy
y81
, which shows the increase over time for the control group, and the interaction withnearinc
shows the additional time effect (increase or decrease) caused bynearinc
. And that is the treatment effect we are looking for.
Let’s estimate the model with an interaction term as outlined above:
out <- lm( lprice ~ nearinc + y81 + nearinc:y81, data = hprice3 )
# summary(out)
Code
modelsummary(list("DiD"=out),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "A|L|B|F",
align = "ldddddd",
output = "gt")
DiD | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
(Intercept) | 11.285*** | 0.031 | 369.839 | <0.001 | 11.225 | 11.345 |
nearinc | -0.340*** | 0.055 | -6.231 | <0.001 | -0.447 | -0.233 |
y81 | 0.457*** | 0.045 | 10.084 | <0.001 | 0.368 | 0.546 |
nearinc × y81 | -0.063 | 0.083 | -0.751 | 0.453 | -0.227 | 0.102 |
Num.Obs. | 321 | |||||
R2 | 0.409 | |||||
RMSE | 0.34 | |||||
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
The estimated coefficient of the interaction variable is -0.063, which is identical to the classical difference in difference estimator in the example above
-
However, we now have estimated standard errors and t-values enabling statistical tests
Actually, the found effect is not significant!
Notice, the negative sign of
nearinc
which means that prices of these houses have been 34% lower long before the incinerator was build or even announced. Why?
We further check: \beta_2 = 0.457 is the price change of the control group (not near the incinerator). This is identically to D1 in the example above
We check: \beta_2 + \beta_3 = 0.394 is the price change of the treatment group (near the incinerator). This is identically to D2 in the example above
And D2-D1 gives -0.063, the DiD estimator of the example above, which is \beta_3
We estimate the same model as above, but this time with additional control variables: age
of house, larea
= log of house area, lland
= log of area of the property and linstsq
= log of distance to interstate2.
These should control for the price differences before the incinerator was announced
And more importantly, the characteristics of the houses sold in 81 could have changed compared to 78 in a different way for houses near or not near the incinerator
Code
modelsummary(list("DiD with controls"=out1),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "A|L|B|F",
align = "ldddddd",
output = "gt")
DiD with controls | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
(Intercept) | 6.690*** | 0.342 | 19.559 | <0.001 | 6.017 | 7.363 |
nearinc | 0.012 | 0.049 | 0.245 | 0.806 | -0.084 | 0.108 |
y81 | 0.418*** | 0.029 | 14.195 | <0.001 | 0.360 | 0.476 |
larea | 0.509*** | 0.041 | 12.482 | <0.001 | 0.429 | 0.590 |
lland | 0.116*** | 0.025 | 4.546 | <0.001 | 0.066 | 0.166 |
linstsq | -0.005* | 0.002 | -2.509 | 0.013 | -0.008 | -0.001 |
age | -0.012*** | 0.001 | -9.292 | <0.001 | -0.014 | -0.009 |
agesq | 0.000*** | 0.000 | 7.180 | <0.001 | 0.000 | 0.000 |
nearinc × y81 | -0.151** | 0.053 | -2.849 | 0.005 | -0.256 | -0.047 |
Num.Obs. | 321 | |||||
R2 | 0.774 | |||||
RMSE | 0.21 | |||||
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
This now leads to a significant effect of the interaction variable
nearinc
:y81
of about -15%.Furthermore,
nearinc
in 78 (before the incinerator was build) is not significant any more as it should
Conclusion: Regressions with interaction terms have several advantages over a classical DiD estimators:
It enables tests, additional control variables, which affects the groups differently over time, and allows for continuous treatment variables instead of dummies
-
However, all DiD estimates will be biased and inconsistent if different trends (or other factors, like self-selection), which are not controlled for, are at work in the control and treatment groups
-
This is especially relevant when the treatment of groups starts at different time periods (staggered timing)
and the treatment effects are not homogeneously effective for the groups and/or more importantly,
the treatment effects are not constant over time (change the trend or exhibit a dynamic pattern)
For these advanced aspects see Goodman-Bacon (2021) or his very enlightening video https://www.youtube.com/watch?v=m1xSMNTKoMs
-
Some possible solutions, notably in a panel setting, are proposed in Callaway and Sant’Anna (2021), Sun and Abraham (2021), Wooldridge (2021), Lee and Wooldridge (2023) and Gardner (2022)
- Applying standard DiD estimators in a panel setting to data with staggered treatment timings can lead to a particular issue: comparing units treated at one point in time with other groups that were treated in some previous periods. This basically leads to a violation of the parallel trend assumption if treatment causes not only a level effect, but also a trend effect. The proposed solutions from some papers mentioned above want to avoid this specific type of comparison.
On the other hand, some papers (especially those from Wooldridge) argue that a misspecification of the estimated model is the core of the problem which should be tackled with a full set of dummy/interaction variables.
Both approaches generally leads to valid estimates.
- Applying standard DiD estimators in a panel setting to data with staggered treatment timings can lead to a particular issue: comparing units treated at one point in time with other groups that were treated in some previous periods. This basically leads to a violation of the parallel trend assumption if treatment causes not only a level effect, but also a trend effect. The proposed solutions from some papers mentioned above want to avoid this specific type of comparison.
For a simple R implementation of an event study approach (which allows for dynamic effects but assumes homogeneous treatment effects), look at the video: https://www.youtube.com/watch?v=NmKNQdsLlh0
In R, there are several specialized packages for doing generalized DiD estimates which allow for heterogeneous dynamic treatment effects and staggered timings; for example
did
implements the approach from Callaway and Sant’Anna (2021),did2s
implements Butts and Gardner (2021),fixest
withsunab()
function does Sun and Abraham (2021) andetwfe
realizes Wooldridge (2021).
-
Simulation study
5.6 Chow Test
Returning to a previous example, Table 5.7, about the effect of gender on the received wage, it is very interesting to note that the specification with interactions of female with all other variables is identical to separately estimating two models, one each for men and woman. Below, the results for these two independent estimations are shown in Table 5.13 together with the results from Table 5.7.
-
These two separate estimates for the subgroups man and woman yield the very same parameter estimates as a common estimation but with all the interaction terms
- In particular, the results for the man-equation are identical to the results with the full set of interaction variables, and the coeficients of the women-equation are identical to the men-results plus the corresponding interaction terms
The F-test previously carried out regarding all interaction effects below Table 5.7 could also be accomplished with the two separately estimated regressions with the test statistic Equation 3.15
F \ = \ \frac{\left(\hat{\mathbf{u}}_{r}^{\prime} \hat{\mathbf{u}}_{r}-\hat{\mathbf{u}}_{ur}' \hat{\mathbf{u}}_{ur} \right) / m}{\hat{\mathbf{u}}'_{ur} \hat{\mathbf{u}}_{ur} /(n-k-1)} \ \sim \ F(m, n-k-1)
- In this setup, the unrestricted SSR is given by the sum of the SSRs of the two separate regressions. Furthermore, the number of restrictions is: m = (k + 1), 6 in our case, because the 6 parameters of the two regressions each have to be equal if there is no difference between the subgroups. The number of freely estimated parameters is 12
Code
modelsummary( list("Men "=out3, "Woman"=out4, "Both"=out2),
output="gt",
statistic = "statistic",
gof_omit = "A|B|L|F",
align = "lddd",
stars = TRUE,
fmt = 3)
Men | Woman | Both | |
---|---|---|---|
(Intercept) | 0.215 | 0.323* | 0.215+ |
(1.629) | (2.308) | (1.659) | |
educ | 0.087*** | 0.073*** | 0.087*** |
(9.700) | (7.002) | (9.881) | |
exper | 0.040*** | 0.017* | 0.040*** |
(5.603) | (2.581) | (5.708) | |
I(exper^2) | -0.001*** | 0.000* | -0.001*** |
(-4.864) | (-2.529) | (-4.955) | |
tenure | 0.033*** | 0.039*** | 0.033*** |
(3.628) | (3.349) | (3.696) | |
I(tenure^2) | -0.001+ | -0.001** | -0.001* |
(-1.945) | (-2.801) | (-1.981) | |
female | 0.108 | ||
(0.561) | |||
female × educ | -0.014 | ||
(-1.034) | |||
female × exper | -0.023* | ||
(-2.340) | |||
female × I(exper^2) | 0.000+ | ||
(1.794) | |||
female × tenure | 0.007 | ||
(0.451) | |||
female × I(tenure^2) | -0.001 | ||
(-1.443) | |||
Num.Obs. | 274 | 252 | 526 |
R2 | 0.446 | 0.260 | 0.461 |
RMSE | 0.40 | 0.38 | 0.39 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Carrying out a F-test with formula from Equation 3.15
```{r}
#| comment: " "
# Running a single regression with the implicit assumption that there are
# no differences in the parameters values between man and woman, restricted model
<- lm(log(wage) ~ educ + exper + I(exper^2) + tenure + I(tenure^2),
out1 data = wage1)
<- sum(out1$residuals^2) # SSR_r of the restricted model
ssr_r <- length(out1$residuals) # number of observations
n
# SSRs for the two separate, unrestricted equations from out3 and out4
<- sum(out3$residuals^2); ssr4<- sum(out4$residuals^2)
ssr3
# SSR_ur is the sum of the SSRs of the two separate, unrestricted equations
<- ssr3 + ssr4
ssr_ur
# F test statistic; m = k+1 = 6; in sum we have 2*6 parameters to estimate
<- ( (ssr_r - ssr_ur)/6 ) / ( (ssr_ur)/(n-2*6) )
F
# prob-value of F statistic: Plug in the values for F, df1 and df2 into
# the F-distribution function
<- df(F, 6, n-2*6)
pval
"Result of F-test, whether all coefficients for man and woman are equal"
paste("F-statistic =", round(F,3), " p-value =", pval)
```
[1] "Result of F-test, whether all coefficients for man and woman are equal"
[1] "F-statistic = 14.997 p-value = 1.86961976712504e-15"
As expected, the F- test statistic is identical to the one where we tested for the exclusion of all variables containing female, see below Table 5.7
This version of the test is called Chow Test. The Chow test is a general test to verify whether subgroups of the sample, (e.g., men/woman, married/unmarried or other characteristics) follow the same population relationship - or the same Data generating process
Originally, the Chow test was intended for detection of structural breaks in time series relationships, e.g. Phillips curves before and after Brexit. With time series, carrying out multiple Chow tests for different periods one can try to detect the point of time of a structural break. This is called Chow breakpoint test
-
But as we have seen, it is not necessary to estimate separate equations for every subgroup of the sample in carrying out this test
- Rather, it is sufficient to include the appropriate dummy variable,
female
in our example (or time after Brexit), and the full set of interaction terms with that dummy variable to allow for changes in the intercept, as well as in all slopes
- Rather, it is sufficient to include the appropriate dummy variable,
5.7 Categorical data in R
In R, factors are an extension of dummy variables designed for storing categorical information especially, if there are more than two categories
Internally, a factor variable in R consist of a vector of integers with values 1 to k, and a character vector, called
levels
, which is stored as an attribute of the variable and contains the strings for the labels of the k categories-
Using factors in R have several advantages
When encountering a factor, R knows that this variable contains categorical information and can choose appropriate methods automatically, for instance for plotting and printing
Furthermore, many packages supports factor variables, e.g. when calculating APEs
If we have k categories, for instance for different occupations, these k categories are stored in a single factor variable. Using this factor in a regression, R automatically expands this factor into k-1 dummy variables
To define a dummy variable or a character vector, which contain the categories, as a factor variable simply use the function
factor
( varname )
Example data set
library(AER)
library(wooldridge)
data("CPS1985")
str(CPS1985)
'data.frame': 534 obs. of 11 variables:
$ wage : num 5.1 4.95 6.67 4 7.5 ...
$ education : num 8 9 12 12 12 13 10 12 16 12 ...
$ experience: num 21 42 1 4 17 9 27 9 11 9 ...
$ age : num 35 57 19 22 35 28 43 27 33 27 ...
$ ethnicity : Factor w/ 3 levels "cauc","hispanic",..: 2 1 1 1 1 1 1 1 1 1 ...
$ region : Factor w/ 2 levels "south","other": 2 2 2 2 2 2 1 2 2 2 ...
$ gender : Factor w/ 2 levels "male","female": 2 2 1 1 1 1 1 1 1 1 ...
$ occupation: Factor w/ 6 levels "worker","technical",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sector : Factor w/ 3 levels "manufacturing",..: 1 1 1 3 3 3 3 3 1 3 ...
$ union : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
$ married : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 1 1 2 1 ...
Attributes of a factor variable
```{r}
#| comment: " "
attributes(CPS1985$gender)
levels(CPS1985$occupation)
```
$levels
[1] "male" "female"
$class
[1] "factor"
[1] "worker" "technical" "services" "office" "sales"
[6] "management"
Factor variables in regressions
out <- lm( log(wage) ~ education + age + gender + married + ethnicity + occupation,
data = CPS1985 )
modelsummary(list("Regression with factor variables"=out),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
gof_omit = "A|L|B|F",
align = "ldddddd",
output = "gt")
Regression with factor variables | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
(Intercept) | 0.926*** | 0.144 | 6.415 | <0.001 | 0.642 | 1.209 |
education | 0.062*** | 0.010 | 6.429 | <0.001 | 0.043 | 0.081 |
age | 0.011*** | 0.002 | 6.267 | <0.001 | 0.007 | 0.014 |
genderfemale | -0.228*** | 0.042 | -5.395 | <0.001 | -0.311 | -0.145 |
marriedyes | 0.074+ | 0.042 | 1.768 | 0.078 | -0.008 | 0.156 |
ethnicityhispanic | -0.121 | 0.089 | -1.362 | 0.174 | -0.294 | 0.053 |
ethnicityother | -0.066 | 0.058 | -1.132 | 0.258 | -0.180 | 0.048 |
occupationtechnical | 0.146* | 0.071 | 2.059 | 0.040 | 0.007 | 0.285 |
occupationservices | -0.192** | 0.062 | -3.069 | 0.002 | -0.314 | -0.069 |
occupationoffice | -0.043 | 0.064 | -0.664 | 0.507 | -0.169 | 0.083 |
occupationsales | -0.215** | 0.082 | -2.611 | 0.009 | -0.376 | -0.053 |
occupationmanagement | 0.159* | 0.076 | 2.085 | 0.038 | 0.009 | 0.309 |
Num.Obs. | 534 | |||||
R2 | 0.326 | |||||
RMSE | 0.43 | |||||
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Interpretation and redefinitions
In interpreting the coefficients note that the coefficients are relative to the basis category, which is the one left out (by default the first category)
For instance \hat\beta_9, the coefficient of
occupationservices
= -0.192, means that, everything else equal, a person working for services earns 19.2% less than a person working as worker-
If you want to change the basis category, you can do this with the function
relevel
()For instance,
relevel
(CPS1985
$occupation
, “management”) makes management to the basis of occupationWith
levels
() you can define, name or rename the categories. For instance:levels
(varname
) \text{ } = \text{ }c
(“low”, “high”)
-
It is also possible to recast a continuous variable into different categories, for instance a income variable into low, middle and high income:
-
new_name
\text{ } = \text{ }cut
(old_name
, 3) \text{ } (makes three categories; instead of a number you can also provide a vector with cutpoints) -
levels
(new_name
) \text{ } = \text{ }c
(“low”, “mid”, “high”)
-
5.8 The linear probability model (LPM)
Until now, we only analysed models with dummy variables on the right hand side – as explanatory variables.
Now, we assume a linear regression with a binary dependent variable, i.e., a dummy variable as left hand variable.
- For instance, y may represents labor force participation of married women, which is one if the woman is in the labor force and is zero if not
y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k + u
Under MLR.4, we have as usual:
\Rightarrow \ E(y \mid \mathbf x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k = \, \mathbf x \boldsymbol \beta
As the dependent variable only takes on the values 0 and 1 we have:
E(y \mid \mathbf x) := \, 1 \cdot P(y=1 \mid \mathbf x) + 0 \cdot P(y=0 \mid \mathbf x)
This is the linear probability model, LPM: The model estimates the probability that y=1; in our example that a married woman is part of the labor force.
- The coefficient \beta_j describes the partial effect of the explanatory variable j on the probability that y=1 \beta_j = \dfrac{\partial P(y=1\mid \mathbf x)}{\partial x_j}
For given \mathbf x, the error term u_i of the LPM can only take two values: u_i = \begin{cases} 1 - P(y=1 \mid \mathbf x), \ \text{ if } y_i = 1 \\ \ \ - P(y=1 \mid \mathbf x), \ \text{ if } y_i = 0 \end{cases} Whereby: P(y=1 \mid \mathbf x) \ = \ \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k \, = \ \mathbf x \boldsymbol \beta
Hence, for given \mathbf x, u_i as well as y_i are bernoulli distributed random variables
Although both possibles values of u_i depend on \mathbf x, the expected value of u_i, conditional on \mathbf x, surprisingly is zero. Hence, MLR.4 is automatically fulfilled, leading to unbiased and consistent estimates of \beta_j (assuming the model is correctly specified, which is generally questionable for the LPM):
E(u_i \mid \mathbf x) \ =
\left (1 - P(y=1 \mid \mathbf x) \right) \cdot P(y=1 \mid \mathbf x) \ + \ \left ( - P(y=1 \mid \mathbf x) \right) \cdot \underbrace{(1 - P(y=1 \mid \mathbf x)}_{P(y=0 \mid \mathbf x)} \ =
P(y=1 \mid \mathbf x) \ - \ P(y=1 \mid \mathbf x)^2 \ - \ P(y=1 \mid \mathbf x) \ + \ P(y=1 \mid \mathbf x)^2 \ = \ 0
- However, as we show below, the variance of u_i depends on \mathbf x, hence, we have heteroscedastic errors
Example – labor force participation of married woman
'data.frame': 753 obs. of 22 variables:
$ inlf : int 1 1 1 1 1 1 1 1 1 1 ...
$ hours : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
$ kidslt6 : int 1 0 1 0 1 0 0 0 0 0 ...
$ kidsge6 : int 0 2 3 3 2 0 2 0 2 2 ...
$ age : int 32 30 35 34 31 54 37 54 48 39 ...
$ educ : int 12 12 12 12 14 12 16 12 12 12 ...
$ wage : num 3.35 1.39 4.55 1.1 4.59 ...
$ repwage : num 2.65 2.65 4.04 3.25 3.6 ...
$ hushrs : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
$ husage : int 34 30 40 53 32 57 37 53 52 43 ...
$ huseduc : int 12 9 12 10 12 11 12 8 4 12 ...
$ huswage : num 4.03 8.44 3.58 3.54 10 ...
$ faminc : num 16310 21800 21040 7300 27300 ...
$ mtr : num 0.721 0.661 0.692 0.781 0.622 ...
$ motheduc: int 12 7 12 7 12 14 14 3 7 7 ...
$ fatheduc: int 7 7 7 7 14 7 7 3 7 7 ...
$ unem : num 5 11 5 5 9.5 7.5 5 5 3 5 ...
$ city : int 0 1 0 0 1 1 0 0 0 0 ...
$ exper : int 14 5 15 6 7 33 11 35 24 21 ...
$ nwifeinc: num 10.9 19.5 12 6.8 20.1 ...
$ lwage : num 1.2102 0.3285 1.5141 0.0921 1.5243 ...
$ expersq : int 196 25 225 36 49 1089 121 1225 576 441 ...
- attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
Estimation
We estimate a model to explain the probability of labor force participation of married woman, inlf
(for in labor force).
The explanatory variables are:
nwifeinc
(additional family income),educ
,exper
,exper^2
,age
,kidslt6
(number of children under 6 years) and kidsge6
(number of children over 6 years)
# We calculate squared residuals and fitted values for plotting
ressq <- outm$residuals^2
yhat <- outm$fitted.values
# Note, we have heteroskedasticity robust standard errors
modelsummary(list("y = P(y=1 | x)"=outm),
shape = term ~ statistic,
statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
stars = TRUE,
align = "ldddddd",
vcov = c(vcovHC),
output = "gt")
y = P(y=1 | x) | ||||||
---|---|---|---|---|---|---|
Est. | S.E. | t | p | 2.5 % | 97.5 % | |
(Intercept) | 0.586*** | 0.154 | 3.812 | <0.001 | 0.284 | 0.887 |
nwifeinc | -0.003* | 0.002 | -2.185 | 0.029 | -0.006 | 0.000 |
educ | 0.038*** | 0.007 | 5.177 | <0.001 | 0.024 | 0.052 |
exper | 0.039*** | 0.006 | 6.600 | <0.001 | 0.028 | 0.051 |
I(exper^2) | -0.001** | 0.000 | -2.997 | 0.003 | -0.001 | 0.000 |
age | -0.016*** | 0.002 | -6.664 | <0.001 | -0.021 | -0.011 |
kidslt6 | -0.262*** | 0.032 | -8.143 | <0.001 | -0.325 | -0.199 |
kidsge6 | 0.013 | 0.014 | 0.953 | 0.341 | -0.014 | 0.040 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
-
The estimated coefficients all have the expected sign. Of special interest is the variable
kidslt6
, the number of children under 6 years- The estimated coefficient of -0.26 means that an additional young child reduces the probability of a married woman to be in the labor force by 26%. This is a huge effect! Children over six do not seem to have a notable impact
# Plotting squared residuals against an explanatory variable to
# detect heteroscedasticity, which is clearly present
plot(mroz$kidslt6,ressq)
# Residuals versus predicted values; for given x,
# only two values for the residuals are possible
plot(outm$residuals ~ yhat )
abline(0,0)
Summary
-
Advantages of the linear probability model
Easy estimation and interpretation; everything is pretty straightforward, which is not case for the alternative and more elaborate probit or logit models discussed in a later section
Estimated effects and predictions are often reasonably good in practice
-
Disadvantages of the linear probability model
Predicted probabilities may be larger than one or smaller than zero, which is not possible for a probability! In practice, this is often not so a big issue, as typically most of the predictions are in the interval [0, 1]
-
The partial effects are constant, so large enough changes in x_j (for instance, 4 children under six years) can lead to changes in probability larger than one, which is not possible. Hence, strictly speaking, the model with constant partial effects cannot be correctly specified. However, very large variations in the explaining variables x_j don’t occur very often in most samples
- Both problems are tackled with the probit or logit models mentioned above
-
As shown above, the linear probability model is necessarily heteroscedastic
- Hence, heteroscedasticity consistent standard errors need to be computed
5.9 Predictions
Having estimated an equation you can use this relationship to make predictions. However, these predictions are made with estimated parameters and are therefore subject to sampling errors, even if all usual assumptions are met. Hence, it is useful to calculate confidence intervals for predictions, based on the estimated regression line.
The predicted value of a regression model for specific values of \mathbf x_{f} (which typically are not in the sample) is:
\hat E(y_f \mid \mathbf x_f) \ = \ \hat y_f \ = \ \hat \beta_0 + \hat\beta_1 x_{f,1} + \ldots + \hat\beta_k x_{f,k}
- Thereby, we simply plug in for \mathbf x_{f} in the estimated regression relationship
The conditional variance of the prediction for a two-variable model with de-meaned variables (so, we need no intercept) is
\operatorname {Var}(\hat y_f \, \mid \, \mathbf X, \mathbf x_f ) \ =
\operatorname {Var}(\hat\beta_1 \, \mid \, \mathbf X, \mathbf x_f) \, x_{f,1}^2 + \operatorname {Var}(\hat\beta_2 \, \mid \, \mathbf X, \mathbf x_f) \, x_{f,2}^2 +
2 \operatorname {Cov}(\hat\beta_1, \hat\beta_2 \, \mid \, \mathbf X, \mathbf x_f ) x_{f,1} x_{f,2}
The variance of the prediction is solely due to the uncertainty regarding the estimated parameters. Hence, this is a weighted sum of the variances and covariance(es) of the estimated coefficients
If we have estimated the variance of the prediction we can construct a 95% confidence interval in the sense that for repeated samples the true predictor E(y_f \mid \mathbf x_f) is within this confidence interval in 95% of the cases. This approximately is: 5
E(y_f\mid \mathbf x_f) \ \ \underset {95\%} \in \ \ \left[ \hat y_f \ \pm 2 \ \sqrt{ \widehat{\operatorname {Var}(\hat y_f \mid \mathbf X, \mathbf x_f )} } \right] \tag{5.6}
Note, this confidence interval is solely with regard to variations of \hat {\boldsymbol \beta} in repeated samples, i.e., to the uncertainty of the parameter estimates
Furthermore, if \hat \beta_j is conditionally normal distributed, then the same is true for \hat y_f
If we want to know how accurate the the prediction is we need to know the variance of the prediction error for y_f
The prediction error e_f is defined as
e_f := \ \ (y_f - \hat y_f) \ \ =
\underbrace {\beta_0 + \beta_1 x_{f,1} + \ldots +\beta_k x_{f,k} + u_f}_{y_f} \, - \,\underbrace { (\hat \beta_0 + \hat\beta_1 x_{f,1} + \ldots + \hat\beta_k x_{f,k}}_{\hat y_f} ) \, =
\underbrace{ (\beta_0 - \hat\beta_0) + (\beta_1 - \hat \beta_1) x_{f,1} + \cdots + (\beta_k - \hat \beta_k) x_{f,k} }_{\text{prediction error due to errors in parameter estimates}} \ \, + \underbrace {u_f}_{\text{due to u} }
- Hence, the variance of the prediction error is the sum of the two independent components above:
\operatorname {Var}(e_f \mid \mathbf X, \mathbf x_f ) \, = \, \operatorname {Var}(\hat y_f \mid \mathbf X, \mathbf x_f ) + \sigma^2
The first part is the variance of prediction from above and the second part is the variance of the regression error term
-
Regarding the independence of these two parts, which is necessary for the derivation of the formula above, we have to discuss a subtle point: u_f is not used to estimate \hat\beta_j and therefore the two are statically independent
- The reason is that estimation is carried out with a particular sample and a particular realization of u. The prediction is carried out later, for an arbitrary x_f, and the prediction error realizes with another draw, u_f, from the distribution of u. Hence, \hat\beta_j and u_f are independent, if u is not autocorrelated (MLR.5)
- We now can construct a 95% confidence interval in the sense that for repeated samples the actual y_f is within this confidence interval in 95% of the cases. This approximately is:
y_f \ \ \underset {95\%} \in \ \ \left[ \hat y_f \ \pm \ 2 ( \sqrt{ \widehat{\operatorname {Var}(\hat y_f \, \mid \, \mathbf X, \mathbf x_f )} + \hat \sigma^2 }) \right] \tag{5.7}
This second confidence interval is with regard to the actual realizations of y_f and tells us something about the distribution of y-values
The former confidence interval is with regard to the expected value of y_f and tells us something about the uncertainty in the mean. The two differ with regard to the variance of the error term
-
In R, the following naming is used:
The former interval, the narrower one (regarding the expected value E(y_f | \mathbf x_f) is denoted by
confidence
(meaning the confidence in the mean)The latter interval, the wider one (regarding the realizations of y_f) is denoted by
prediction
(meaning interval of the prediction of y_f)
Final remark: If x_f is in the sample, the prdiction error e_f is simply the residual \hat u_f
Example
We, once more, use the model which predicts the test score of students dependent on the attendance rate of a microeconomic lecture and prior test results,
priGPA
andATC.
This model was stored in the list myres.Now, we want to predict the outcome for three artificial students, the first two are average students but the first does not attend the micro lecture, and the second one attends the lecture. The third has a very good test record and attends the lecture. In particular we have the following data x_f
Student A:
atndrte
= 0,priGPA
=mean
(priGPA
),ATC
=mean
(ATC
)Student B:
atndrte
= 100,priGPA
=mean
(priGPA
),ATC
=mean
(ATC
)Student C:
atndrte
= 100,priGPA
= 4,ATC
= 30We have to store these values in a new data frame and call the function
predict
()We demand both confidence intervals with the option “
interval
=”
# Specify values of exogenous variables for which we want a prediction
# the data have to be stored in a data.frame
# Calculating means
mean_priGPA = mean(attend$priGPA)
mean_ACT = mean(attend$ACT)
xf <- data.frame( atndrte = c(0,100,100) , priGPA = c(mean_priGPA, mean_priGPA, 4), ACT = c(mean_ACT, mean_ACT, 30 ) )
# Prediction and confidence interval for expected values
predict(myres, xf, interval = "confidence")
fit lwr upr
1 -0.767446712 -1.2091453 -0.3257481
2 0.006209056 -0.1222101 0.1346282
3 2.072518373 1.7088323 2.4362044
# Prediction and confidence interval for actual values
predict(myres, xf, interval = "prediction")
fit lwr upr
1 -0.767446712 -2.5373289 1.002435
2 0.006209056 -1.7124755 1.724894
3 2.072518373 0.3204759 3.824561
# With `plot_predictions` from the package `marginaleffects` we can draw the
# confidence interval for expected values (representing parameter uncertainty)
# in dependence of a particular variable and the other variables are at their
# mean values
plot_predictions(myres, condition = "atndrte", points=0.2, rug=TRUE) + theme_bw()
Predictions with log(y)
If the depended variable is in logs the predicted value of y_i is not simply:
\hat y \ \neq \ \exp ({\widehat {\log y_i}}) \ = \ \exp( \underbrace {\hat\beta_0 + \hat\beta_1 x_1 + \cdots + \hat\beta_k x_k}_{\widehat {\log y_i}} )
- In particular we have
\log y \ = \ \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + u \ \ \Rightarrow
y \ = \ \exp (\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k) \cdot \exp (u) \ \ \Rightarrow
E(y \mid \mathbf x ) \ = \ \exp (\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k) \cdot E( \exp (u) \mid \mathbf x )
-
If u_i is normally distributed and independent of \mathbf x, than E( \exp (u) \mid \mathbf x ) = \exp (\sigma^2 / 2 ) and a prediction for y_i is
\hat y_i \ = \ \exp({\widehat{\log y_i}}) \exp(\hat\sigma^2 / 2)
Here, \hat\sigma^2 is our standard estimator for \sigma^2
-
This estimator for y_i is biased but consistent (it is biased, because \sigma^2 is replaced by \hat \sigma^2 inside a non-linear function, compare Wooldridge (2019), p. 206)
- Note that \exp(\hat\sigma^2 / 2) is always > 1, thus \exp ({\widehat {\log y_i}}) would typically underestimate y_i
If the assumption of normally distributed errors is not fulfilled, as is often the case, a MoM estimator can be applied: Replace E( \exp (u) \mid \mathbf x ) with the corresponding sample average to get a consistent estimate of \hat y_i
Matrix notation
Studentized residuals
In this case, the APE is equal to the Partial Effect at Average (PEA), but this is not always the case.↩︎
See Card and Krueger (1994) for a famous and controversial investigation of the effects of minimum wages on employment using a DiD estimator. They found a positive effect of rising minimum wages on employment with their DiD estimators, which is not easy to explain and suggest the assumption of an involved bias. For a critical comment on this work see Neumark and Wascher (2000). For a more modern approach with a different result see Callaway and Sant’Anna (2021).↩︎
In a panel setting (which is not the case here – we have a pooled cross-section) with units i and two periods we simply could regress:
\Delta lprice_i = \beta_0 + \beta_1 nearinc_i + u_i, or
lprice_{it} = \beta_1 nearinc_{it} + a_i + d_{t} + u_{it},
with nearinc_{it} is 0 before the incinerator was build and 1 after that event for those units nearby, a_i being unit-specific intercepts and d_t time-specific intercepts (both resulting from the inclusion of corresponding dummy variables, see Chapter 9).
Actually, if the same units are observed in the two (or more) periods, these specifications are most often used in applied work.↩︎As usual, both variables have to be included in the equation for a correct interpretation.↩︎
With the true predictor E(y_f \mid \mathbf x_f) we mean the corresponding point on the population regression function, Figure 1.2.↩︎