Code
library(AER); library(effects); library(wooldridge)
data(mroz)
Binary variables, e.g., employed/not employed
Non-negative variables, e.g., wages, prices, interest rates
Count variables, e.g., the number of arrests in a year
Censored variables, e.g., unemployment duration
We further have
The sample used to infer population relationships is endogenously selected, e.g., a wage offer regression but wage data are only observable if people are actually working
Until now, we only analysed models with dummy variables on the right hand side – as explanatory variables.
Now, we analyze models with a binary dependent variable, i.e., a dummy variable as left hand side variable.
y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k + u
Under MLR.4, we have as usual (population regression line):
\Rightarrow \ E(y \mid \mathbf x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k = \, \mathbf x \boldsymbol \beta
As the dependent variable only takes on the values 0 and 1 we additionally have in this case:
E(y \mid \mathbf x) := \, 1 \cdot P(y=1 \mid \mathbf x) + 0 \cdot P(y=0 \mid \mathbf x)
This is the linear probability model, LPM: The model estimates the probability that y=1; in our example that a married woman is part of the labor force.
\beta_j = \dfrac{\partial P(y=1\mid \mathbf x)}{\partial x_j}
library(AER); library(effects); library(wooldridge)
data(mroz)
# LPM for labor forece participation of married woman
<- lm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz) outm
# We calculate squared residuals and fitted values for plotting
<- outm$residuals^2
ressq <- outm$fitted.values yhat
A popular measure of fit for such models is the hit rate, which measures the portion of correctly predicted observations of the dependent variable (which has only zeros or ones)
Besides the overall hit rate, one can separately examine, how the hit rates are for values of y_i=1 and y_i=0
The portion of correctly predicted y_is, when y_i=1, is called sensitivity. This is the ability of the model to detect true “positives” (e.g., the percentage of people predicted to be in the workforce and actually are or, as an example from medicine,
The portion of correctly predicted y_is, when y_i=0, is called specificity. This is the ability of the model to detect true “negatives” (e.g., the percentage of people predicted not to be in the workforce and actually are not or, as an example from medicine,
Below, we calculate the hit rates described above for the a LPM, and for purposes of comparison, for a corresponding Probit model (discussed below). The results are nearly identical – (the code for this hit-rate function is listed below and could by used by any user)
<- function(outm, cutoff=0.5, stragaz=TRUE) {
hitrate
# This function calculates overall hit rates, sensitivity and specificity for
# LPM, Logit and Probit models
# outm = output list of `lm()` or `glm()`
# cutoff = threshold value for predicting 1, default = 0.5
#requires package stargazer
# load dependent variable of model
<- outm$model[[1]]
y
# load fitted values
<- outm$fitted.values
yhat <- length(y)
ly
# prediction of one, if yhat > threshold
<- (yhat > cutoff)*1
ypred
# y which are 1
<- y[y==1]
y_1 <- length(y_1)
ly_1
# y which are zero
<- y[y==0]
y_0 <- length(y_0)
ly_0
# prediction error
<- y - ypred
prederror
# overall hits
<- rep(0, ly)
hit_o <- ifelse( (ypred == 1 & y == 1) | (ypred == 0 & y == 0) ,1, 0)
hit_o <- sum(hit_o)/ly
hitrate_o
# correct prediction if y=1
<- rep(0, ly)
hit_1 <- ifelse((ypred == 1 & y == 1), 1, 0)
hit_1 <- sum(hit_1)/ly_1
hitrate_1
# correct prediction if y=0
<- rep(0, ly)
hit_0 <- ifelse((ypred == 0 & y == 0), 1, 0)
hit_0 <- sum(hit_0)/ly_0
hitrate_0
# storing results in data frame
<- data.frame(HitRate = hitrate_o, Sensitivity = hitrate_1, Specificity = hitrate_0,
results Nof1s = ly_1, Nof0s = ly_0, check.rows=TRUE)
if (stragaz==TRUE) {
::stargazer(results, type = "text", summary = FALSE, rownames = FALSE)
stargazer
}
return(results)
}
# Hit rate of LPM
<- hitrate(outm) hr_lpm
===========================================
HitRate Sensitivity Specificity Nof1s Nof0s
-------------------------------------------
0.734 0.818 0.625 428 325
-------------------------------------------
# For comparison, estimate a corresponding Probit model
# which we introduce below
<- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
outprobit data = mroz, family = binomial(link = "probit"))
# Hit rate of Probit model
<- hitrate(outprobit) hr_probit
===========================================
HitRate Sensitivity Specificity Nof1s Nof0s
-------------------------------------------
0.734 0.813 0.631 428 325
-------------------------------------------
The main problems of the Linear Probability Model (due to the linearity in \mathbf x) are
An obvious solution for these problems is a nonlinear model for binary response
E(y|\mathbf x) \, =\, P(y=1 | \mathbf{x}) \, = \, G\left(\beta_{0}+\beta_{1} x_{1}+\ldots+\beta_{k} x_{k}\right) \, = \, G(\mathbf{x} \boldsymbol{\beta}) \
G(\mathbf x) is the (inverse) link function which links the variables \mathbf x to the probability that y_i = 1
Obvious candidates for the function G() are cumulative distribution functions (cdf), which are increasing and map [-\infty, +\infty] into the unit interval [0,1]
The estimation model for the observable variable y therefore is
y \, = \, G(\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k) + u
The error term u of this model has the same properties as the one in the LPM, especially the errors are heteroscedastic with \operatorname{Var}(u_i | \mathbf x) = G(\mathbf x \boldsymbol \beta) (1 - G(\mathbf x \boldsymbol \beta)), (see Section 5.8)
Conditional on \mathbf x, the observations y_i are Bernoulli distributed
Regarding the choice of the link function there are two main models
G(z) \ = \ \Phi(z) := \, \int_{-\infty}^z \phi(v)dv \ = \, \int_{-\infty}^{\mathbf x \boldsymbol \beta} \phi(v)dv \ \ \quad \text{ :cdf of standard normal distribution}
G(z) \ = \ \Lambda(z) := \, \dfrac {e^z}{1 + e^z} \ = \ \dfrac {e^{\mathbf x \boldsymbol \beta}}{1 + e^{\mathbf x \boldsymbol \beta}} \ \ \ \quad \text{ :cdf of standard Logistic distribution}
Both distributions are symmetric and have quite similar shapes. However, the Logistic distribution has fatter tails - it’s nearly identical to a t(8)-distribution; therefore, outliers are somewhat more probable rendering the logistic estimates a little bit more robust against outliers
y^* = \mathbf x \boldsymbol \beta + e, \quad \text{ and } \ \ y_i = \begin{cases} 1 \ \text{ if } \ y_i^*>0 \\ 0 \ \text{ if } \ y_i^*\leq 0 \end{cases}
e \sim N(0,1) with the Probit model; and e \sim Logistic(0,\pi^2/3) with the Logit model
If the latent variable y^* is larger than zero, the observable y takes on the value 1, if it is less or equal zero, y takes on 0 (y^* can thus be interpreted as the propensity to have y = 1)
Note that there is rarely a well defined unit of measurement for y^*. Therefore, identification requires a scaling, which is done by setting the variance of e to one, respectively to \pi^2/3
We do not observe the utility of the action – y^* is a latent variable – we only observe whether the action is taken or not, y=1 or y=0. But this is the case if y^* > or \leq 0. In the following it is shown that this model of discrete choice leads to our estimation model
E(y \,| \, \mathbf x) \ = \ P(y = 1\,| \,\mathbf x) \ = \ P(y^* > 0 \,| \,\mathbf x) \ = \ P(\mathbf x \boldsymbol \beta + e > 0\,| \,\mathbf x) \ = \ P(e > -\mathbf x \boldsymbol \beta \,| \,\mathbf x) \ \ =
1 - G(-\mathbf x \boldsymbol \beta) \ \ = \text { (by symmetry) } \
G(\mathbf x \boldsymbol \beta) \ = \begin{cases} \Phi(\mathbf x \boldsymbol \beta) \ \text{ :Probit model } \\ \Lambda(\mathbf x \boldsymbol \beta) \ \text{ :Logit model } \end{cases}
E(y \,| \, \mathbf x) \, = \, G(\mathbf x \boldsymbol \beta)
y \, = \, G(\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k) + u
Sometimes, a non-symmetric distribution seems more appropriate, for instance if the number of ones (success) and zeros (unsuccess) are very different in the sample. In this case the Gumbel distribution (an extreme value distribution) can be used as an interesting alternative link function G(z) \ =\, {e^{-e^{-z}}} \ = {e^{-e^{-\mathbf x \boldsymbol \beta}}} \ \ \ \ \quad \text{ :cdf of standard Gumbel distribution}
This is a right skewed distribution, which might be appropriate if the ones are the rare (extreme) outcomes. If the zeros are the rare outcomes one could use a left skewed distribution, the (mirrored) Complementary Gumbel distribution: 1-G(-z) \ =\, 1-e^{-e^{z}} \ = 1-e^{-e^{\mathbf x \boldsymbol \beta}} \ \ \ \ \quad \text{ :cdf of Complementary Gumbel distribution}
Interestingly, only this version is implemented with the R procedure glm()
. So, if you want to use this left skewed distribution and the ones are the rare outcomes in your sample, you have to recode the y-s with the command: “ifelse
(dataset$varname_of_y == 1, 0, 1)”, making ones to zeros and zeros to ones
To show graphically the difference between the LPM and the Probit (or Logit) model we estimate the probability of denial of a bank credit (deny
) in dependence of the ratio of monthly dept payments to monthly income (pirat
)
library(visreg) # for plotting partial effect
library(wooldridge);
data("HMDA") # data set from of the wooldridge library
# convert `deny` to numeric. This is coded as a factor variable in the data set
$deny <- as.numeric(HMDA$deny) - 1 HMDA
# estimating linear probability model
<- lm(deny ~ pirat, data = HMDA) denylpm
To estimate the probit model, we use the gml()
function in R - glm for generalized linear models. There are two additional option to set in this function:
family for the distribution of y; this is binomial in our case (binary dependent variable)
and you have to specify the link function G(\mathbf x \boldsymbol \beta) for the probit model
# estimating probit model
<- glm(deny ~ pirat, family = binomial(link = "probit"), data = HMDA) denyprobit
visreg(denylpm, band = FALSE, ylim=c(-0.05, 1.05), main = "LPM", xlab = "P/I ratio")
abline(0,0, lty = 2, col = "darkred"); abline(1,0, lty = 2, col = "darkred")
text(2.5, 0.95, cex = 0.8, "Mortgage denied"); text(2.5, 0.05, cex= 0.8, "Mortgage approved")
points(deny~pirat, data = HMDA, pch = 20, cex.main = 0.85)
visreg(denyprobit, scale = "response", band = FALSE, ylim=c(-0.05, 1.05), main = "Probit", xlab = "P/I ratio", rug=0)
abline(0,0, lty = 2, col = "darkred"); abline(1,0, lty = 2, col = "darkred")
text(2.5, 0.95, cex = 0.8, "Mortgage denied"); text(2.5, 0.05, cex= 0.8, "Mortgage approved")
points(deny~pirat, data = HMDA, pch = 20, cex.main = 0.85)
The two previous figures show that the predicted probabilities in probit model are always within the unit interval, contrary to the LMP
Furthermore, the partial effects of the explanatory variables are not constant any more (as in the LPM), but are nonlinear and depend on the values of \mathbf x
On the one hand, this is a big theoretical advantage as linear partial effects are not possible in the context of predicting probabilities
On the other hand, this is a practical disadvantage as the estimated coefficients \hat\beta are not easily interpretable any more. Basically, you have to relay on average partial effects, APE.
The partial effect of the explanatory variable x_j on the probability that y=1 is
\dfrac{\partial P(y=1| \mathbf x)}{\partial x_j} \ = \ \dfrac {\partial G(\mathbf x \boldsymbol \beta)}{\partial x_j} \ = \begin{cases} \dfrac {\Phi(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \phi(\mathbf x \boldsymbol \beta) \, \beta_j \ \text{ : Probit } \\ \dfrac {\Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \Lambda(\mathbf x \boldsymbol \beta) (1 - \Lambda(\mathbf x \boldsymbol \beta) ) \, \beta_j \ \text{ : Logit } \end{cases}
\dfrac {\partial \Phi(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \, \dfrac {\partial}{\partial \beta_j} \int_{-\infty}^{\mathbf x \boldsymbol \beta} \phi(v)dv \ = \ \phi(\mathbf x \boldsymbol \beta) \beta_j
\dfrac {\partial \Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \dfrac {\partial}{\partial \beta_j} \left( \dfrac {e^{\mathbf x \boldsymbol \beta}}{1 + e^{\mathbf x \boldsymbol \beta}} \right) \ = \ \dfrac {e^{\mathbf x \boldsymbol \beta}}{ (1 + e^{\mathbf x \boldsymbol \beta})^2} \, \beta_j
After some manipulations, we eventually get
\dfrac {\partial \Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \Lambda'(\mathbf x \boldsymbol \beta) \, \beta_j \ = \ \Lambda(\mathbf x \boldsymbol \beta) \, (1 - \Lambda(\mathbf x \boldsymbol \beta) ) \, \beta_j
In both cases the partial effect of x_j on the P(y=1 | \mathbf x) is not equal to \beta_j (which is partial effect on the latent variable y^*), but \beta_j must be multiplied by the corresponding density function, evaluated at \, \mathbf x \boldsymbol \beta, i.e., at particular values of \mathbf x
Hence, the partial effects are nonlinear and depend on the level of \mathbf x
One can calculate APE (average partial effects) for instance with the R package marginaleffects
, or PEA (partial effects at average)
The former is the partial effect calculated for each observation \mathbf x_i and than averaged. For the Probit model this is: \widehat{A P E}_{j} =\left( \frac{1}{n} \sum_{i=1}^{n} \phi\left(\mathbf{x}_{i} \hat{\boldsymbol{\beta}}\right) \right) \hat\beta_{j}
The latter is calculated at the mean of \mathbf x, \bar {\mathbf x}. For the Logit model this is \, \widehat{P E A_{j}} = \Lambda' (\bar{\mathbf{x}} \hat{\boldsymbol{\beta}}) \hat\beta_{j}
PEAs could be somewhat problematic in the case of dummy variables. For instance, if you have a dummy variable for gender in your model, averaging to 0.5 might be questionable
For a discrete explanatory variable x_k, e.g., the number of years of education or gender, the APE can directly be computed (for the Logit model replace \Phi by \Lambda)
\widehat{A P E}_{k} \ = \ n^{-1} \sum_{i=1}^n \left( \Phi\left[\beta_{0}+\beta_{1} x_{i1}+\ldots+\beta_{k}\left(c_{ik}+1\right) \right] \ - \ \Phi \left[\beta_{0}+\beta_{1} x_{i1}+\ldots+\beta_{k} c_{ik} \right] \right)
The R package marginaleffects
apply this formula automatically to factor variables
As both models are highly nonlinear, OLS can not be applied. In principle, NLS (nonlinear last squares) could be applied, but because of the heteroscedasticity of the error terms this would not be efficient
Maximum Likelihood estimation (MLE) is the main procedure
Thereby, we are looking for coefficients \boldsymbol \beta so that the observed sample has the highest probability
Each observation i is treated as a single and independent draw from a Bernoulli distribution (binominal with one draw)
We have a success probability for observation i: \text{ } (remember: G=\Phi for probit and G=\Lambda for logit) P(y_i=1 \, | \, \mathbf x; \boldsymbol \beta) = G(\mathbf x_i \boldsymbol \beta)
and the non-success probability for observation i:
P(y_i=0 \, | \, \mathbf x; \boldsymbol \beta) = 1-G(\mathbf x_i \boldsymbol \beta)
This leads to the joint probability function (for all observations)
P(y_1, \ldots, y_n \, | \, \mathbf x; \boldsymbol \beta) \ = \ \underset {y=0} {\prod_{i=1}^n} \left[ 1-G(\mathbf x_i \boldsymbol \beta) \right] \, \underset {y=1} {\prod_{i=1}^n} G(\mathbf x_i \boldsymbol \beta)
L(\boldsymbol \beta \ | \ \text{data}) \ = \ \prod_{i=1}^n \underbrace {\left[ G(\mathbf x_i \boldsymbol \beta) \right]^{y_i} \, \left[1-G(\mathbf x_i \boldsymbol \beta) \right]^{1-y_i}}_{= f(y_i | \mathbf x, \boldsymbol \beta), \, \text{ conditional density of } y_i}
(Note, if y_i=0 then 1-y_i=1 and if y_i=1 then 1-y_i=0)
Taking logs we get the log-likelihood function
\ln L(\boldsymbol \beta \ | \ \text{data}) \ = \ \sum_{i=1}^n \underbrace {\left\lbrace y_i \ln \left[ G(\mathbf x_i \boldsymbol \beta) \right] \, + \, (1-y_i) \ln \left[1-G(\mathbf x_i \boldsymbol \beta) \right] \right \rbrace}_{\ln f(y_i | \mathbf x, \boldsymbol \beta)}
\dfrac { \partial \ln L}{\partial \boldsymbol \beta} \ = \ \sum_{i=1}^n \left[ y_i \dfrac{g(\mathbf x_i \boldsymbol \beta)}{G(\mathbf x_i \boldsymbol \beta)} \ + \ (1-y_i) \dfrac{-g(\mathbf x_i \boldsymbol \beta)}{(1-G(\mathbf x_i \boldsymbol \beta)} \right] \mathbf x_i \ = \ \mathbf 0, \quad \text{ with } g := G'
\dfrac { \partial \ln L}{\partial \boldsymbol \beta} \ = \ \sum_{i=1}^n [ y_i - \Lambda(\mathbf x_i \boldsymbol \beta) ] \mathbf x_i \ = \ \mathbf 0 \
Noting that [y_i - \Lambda(\mathbf x_i \boldsymbol \beta)] are the residuals of the model, the similarities to the OLS normal equations are obvious (compare Equation 1.9)
If \boldsymbol \beta contains an intercept, the first likelihood equation is
\dfrac { \partial \ln L}{\partial \beta_0} \ = \ \sum_{i=1}^n [ y_i - \Lambda(\mathbf x_i \boldsymbol \beta) ] \mathbf \ = \ 0 \ \ \ \Rightarrow \ \ \ \frac{1}{n} \sum_{i=1}^n y_i = \frac{1}{n} \sum_{i=1}^n \Lambda(\mathbf x_i \boldsymbol \beta)
Interestingly, the likelihood equations for the Probit case don’t look so familiar, involving \phi/\Phi, the inverse Mills ratio
In both cases, the likelihood equations are highly nonlinear and have to be solved numerically, for instance with the Newton-Raphson algorithm
We once more examine a model for the probability of labor force participation of married woman, inlf
(for in labor force)
The explanatory variables are: nwifeinc
(additional family income), educ
, exper
, exper^2
, age
, kidslt6
(number of children under 6 years) and kidsge6
(number of children over 6 years)
This time, we additionally estimate Logit and Probit models. Furthermore, we used the R-package marginaleffects
to calculate the implied APE
The results presented below are pretty similar for the three models
Note, you should only directly compare the APE
As shown previously, the coefficients of Logit and Probit models are non-linear transformations of the partial effects
Fortunately, the R-package marginaleffects
is doing the work for us and calculates the appropriate APE
All estimated coefficients have the expected sign
kidslt6
, the number of children under 6 years. The estimated APE is about -0.26 meaning that, for an average woman, an additional young child reduces the probability to be in the labor force by 26%But for Logit and Probit models, this partial effect depends in a nonlinear manner on the values of the explanatory variables. This is a main advantage compared to the LPM. The following figures show these nonlinear effects of the number of children under 6 on labor participation
library(wooldridge)
library(marginaleffects)
library(modelsummary)
data("mroz")
# LPM model
<- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)
outlpm
#Logit model
<- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz,
outlogit family = binomial(link = "logit"))
# Probit model
<- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz,
outprobit family = binomial(link = "probit"))
# Average partial effects with avg_slopes() from marginaleffects package
<- avg_slopes(outlpm)
marg_lpm <- avg_slopes(outlogit)
marg_logit <- avg_slopes(outprobit)
marg_probit
summary(outlogit)
Call:
glm(formula = inlf ~ nwifeinc + educ + exper + I(exper^2) + age +
kidslt6 + kidsge6, family = binomial(link = "logit"), data = mroz)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.425452 0.860365 0.495 0.62095
nwifeinc -0.021345 0.008421 -2.535 0.01126 *
educ 0.221170 0.043439 5.091 3.55e-07 ***
exper 0.205870 0.032057 6.422 1.34e-10 ***
I(exper^2) -0.003154 0.001016 -3.104 0.00191 **
age -0.088024 0.014573 -6.040 1.54e-09 ***
kidslt6 -1.443354 0.203583 -7.090 1.34e-12 ***
kidsge6 0.060112 0.074789 0.804 0.42154
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1029.75 on 752 degrees of freedom
Residual deviance: 803.53 on 745 degrees of freedom
AIC: 819.53
Number of Fisher Scoring iterations: 4
summary(outprobit)
Call:
glm(formula = inlf ~ nwifeinc + educ + exper + I(exper^2) + age +
kidslt6 + kidsge6, family = binomial(link = "probit"), data = mroz)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2700736 0.5080782 0.532 0.59503
nwifeinc -0.0120236 0.0049392 -2.434 0.01492 *
educ 0.1309040 0.0253987 5.154 2.55e-07 ***
exper 0.1233472 0.0187587 6.575 4.85e-11 ***
I(exper^2) -0.0018871 0.0005999 -3.145 0.00166 **
age -0.0528524 0.0084624 -6.246 4.22e-10 ***
kidslt6 -0.8683247 0.1183773 -7.335 2.21e-13 ***
kidsge6 0.0360056 0.0440303 0.818 0.41350
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1029.7 on 752 degrees of freedom
Residual deviance: 802.6 on 745 degrees of freedom
AIC: 818.6
Number of Fisher Scoring iterations: 4
# Modifications for modelsummery to print HitRates for glm()
library(broom)
# untereinander
<- function(x, ...) {
glance_custom.glm <- round( hitrate(x, stragaz=FALSE)$HitRate, 2 )
HRate <- round( hitrate(x, stragaz=FALSE)$Sensitivity, 2 )
Sens <- round( hitrate(x, stragaz=FALSE)$Specificity, 2 )
Spec <- round( hitrate(x, stragaz=FALSE)$Nof1s / hitrate(x, stragaz=FALSE)$Nof0s, 2 )
Ratio <- data.frame("HitRate" = HRate,
out "Sensitifity" = Sens,
"Specificity" = Spec,
"Ratio 1/0" = Ratio)
return(out)
}
modelsummary( list(" LPM "=outlpm, " LPM APE "=marg_lpm, " Logit "=outlogit,
" Logit APE "=marg_logit, " Probit "=outprobit,
" Probit APE "=marg_probit),
output="flextable",
gof_omit = "F",
align = "ldddddd",
stars = TRUE,
fmt = 4)
| LPM | LPM APE | Logit | Logit APE | Probit | Probit APE |
---|---|---|---|---|---|---|
(Intercept) | 0.5855*** | 0.4255 | 0.2701 | |||
(0.1542) | (0.8604) | (0.5081) | ||||
nwifeinc | -0.0034* | -0.0034* | -0.0213* | -0.0038* | -0.0120* | -0.0036* |
(0.0014) | (0.0014) | (0.0084) | (0.0015) | (0.0049) | (0.0015) | |
educ | 0.0380*** | 0.0380*** | 0.2212*** | 0.0395*** | 0.1309*** | 0.0394*** |
(0.0074) | (0.0074) | (0.0434) | (0.0073) | (0.0254) | (0.0073) | |
exper | 0.0395*** | 0.0268*** | 0.2059*** | 0.0254*** | 0.1233*** | 0.0256*** |
(0.0057) | (0.0025) | (0.0321) | (0.0022) | (0.0188) | (0.0022) | |
I(exper^2) | -0.0006** | -0.0032** | -0.0019** | |||
(0.0002) | (0.0010) | (0.0006) | ||||
age | -0.0161*** | -0.0161*** | -0.0880*** | -0.0157*** | -0.0529*** | -0.0159*** |
(0.0025) | (0.0025) | (0.0146) | (0.0024) | (0.0085) | (0.0024) | |
kidslt6 | -0.2618*** | -0.2618*** | -1.4434*** | -0.2578*** | -0.8683*** | -0.2612*** |
(0.0335) | (0.0335) | (0.2036) | (0.0319) | (0.1184) | (0.0319) | |
kidsge6 | 0.0130 | 0.0130 | 0.0601 | 0.0107 | 0.0360 | 0.0108 |
(0.0132) | (0.0132) | (0.0748) | (0.0133) | (0.0440) | (0.0132) | |
Num.Obs. | 753 | 753 | 753 | 753 | 753 | 753 |
R2 | 0.264 | 0.264 | ||||
AIC | 865.8 | 865.8 | 819.5 | 819.5 | 818.6 | 818.6 |
BIC | 907.4 | 907.4 | 856.5 | 856.5 | 855.6 | 855.6 |
Log.Lik. | -423.892 | -423.892 | -401.765 | -401.765 | -401.302 | -401.302 |
RMSE | 0.42 | 0.42 | 0.42 | 0.42 | 0.42 | 0.42 |
HitRate | 0.73 | 0.73 | 0.74 | 0.74 | 0.73 | 0.73 |
Sensitifity | 0.82 | 0.82 | 0.81 | 0.81 | 0.81 | 0.81 |
Specificity | 0.62 | 0.62 | 0.64 | 0.64 | 0.63 | 0.63 |
Ratio.1.0 | 1.32 | 1.32 | 1.32 | 1.32 | 1.32 | 1.32 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
visreg(outlpm, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05),
main = "Effect of number of children under 6 on LF-participation - LPM")
visreg(outprobit, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05),
main = "Effect of number of children under 6 on LF-participation - Probit")
visreg(outlogit, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05),
main = "Effect of number of children under 6 on LF-participation - Logit")
The starting point for obtaining maximum likelihood estimators (MLE) is the assumption that the conditional distribution of the data – the data generating process – is known, except for a finite number of parameters \boldsymbol \theta which have to be estimated
f(y_i \, | \, \mathbf x_i; \boldsymbol \theta), \ \ \forall i
f(y_1, \ldots, y_n \, | \, \mathbf X; \boldsymbol \theta) \ = \ \prod_{i=1}^n f(y_i \, | \, \mathbf x_i; \boldsymbol \theta)) \ = \ L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X)
The joint density also defines the Likelihood function L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X), which is a function of the unknown parameters \boldsymbol \theta, given the sample data (\mathbf y, \mathbf X)
Remark: If the observations are not independent, as it is very often the case with time series, the joint density is more complicated and involves the covariance matrix of the observations y_i. For more complicated time series modes, in many cases the joint density can be obtained with the so called prediction-error decomposition of the likelihood which is calculated with the help of the Kalman-filter
\ln L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X) \ = \ln L(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \ln f(y_i \, | \, \mathbf x_i, \boldsymbol \theta)
\underset{\boldsymbol \theta}{\max} \ \ln L(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \ln f(y_i \, | \, \mathbf x_i, \boldsymbol \theta)
\dfrac { \partial \ln L}{\partial \boldsymbol \theta} := \ \mathbf g(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \frac {\partial }{\partial \boldsymbol \theta} \, \ln f(y_i | \mathbf x_i, \boldsymbol \theta) \ = \ \mathbf 0
The solution of the likelihood equations leads to the estimated parameter vector \hat{\boldsymbol \theta}_{LM}
This estimate is identified (estimable) if for any other \boldsymbol \theta, L(\hat{\boldsymbol \theta}_{LM}) \neq L(\boldsymbol \theta) (uniqueness)
In most of the cases, the optimization has to rely on numerical optimization procedures like the Newton-Paphson algorithm
The second order condition for maximization of the likelihood function requires that the Hessian matrix is negative definite, so that the likelihood function is concave at the optimum
If the distributional assumption holds Maximum Likelihood estimators are generally
Consistent
\operatorname{plim} \hat{\boldsymbol \theta}_{ML} = \boldsymbol \theta_T, \quad with \boldsymbol \theta_T being the true parameter vector
But in finite samples, MLE are not generally unbiased
\text{ }
Asymptotically normal with
\hat {\boldsymbol \theta}_{ML} \ \, \stackrel{a}{\sim} \ \, N \left( \boldsymbol \theta_T ,\, \lbrace -E[ \mathbf H(\boldsymbol \theta_T)] \rbrace^{-1} \right)
\operatorname {Asy.Var} (\hat {\boldsymbol \theta}_{ML}) \, = \, \ \left( - E \left[ \mathbf H(\boldsymbol \theta_T) \right] \right) ^{-1} \, = \, \left( -E \left[ \dfrac {\partial^2 \ln L}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right] \right)^{-1} \, = \,
\left( -E \, \left[ \begin{array}{ccc} \dfrac {\partial^2 \ln L }{\partial \theta_0 \partial \theta_0} & \cdots & \dfrac {\partial^2 \ln L }{\partial \theta_0 \partial \theta_k} \\ \vdots & \ddots & \vdots \\ \dfrac {\partial^2 \ln L }{\partial \theta_k \partial \theta_0} & \cdots & \dfrac {\partial^2 \ln L }{\partial \theta_k \partial \theta_k} \end{array} \right] \right)^{-1}
\text{ }
Asymptotically efficient in the class of consistent and asymptotically normal estimators
The Cramer-Rao Lower Bound is the asymptotic lower bound of the variance of a consistent and asymptotically normal estimator. And this can be shown to be equal to the inverse of the Information Matrix
\mathbf I_n (\boldsymbol \theta_T) := \operatorname {Var} \left( \mathbf g(\boldsymbol \theta_T) \right) = E \left( \mathbf g(\boldsymbol \theta_T) \, \mathbf g(\boldsymbol \theta_T)' \right),
\text{with } \ \mathbf g(\boldsymbol \theta) := \dfrac {\partial }{\partial \boldsymbol \theta} \ln L(\boldsymbol \theta; \mathbf y), \ \ E[\mathbf g(\boldsymbol \theta_T)] = \mathbf 0
\mathbf I_n (\boldsymbol \theta_T) = \operatorname {Var} \left( \mathbf g(\boldsymbol \theta_T) \right) = -E[\mathbf H(\boldsymbol \theta_T)]
Finite sample property: If the sample is drawn from an exponential family distribution and if there exists an unbiased minimum variance estimator, then it is a MLE
\mathbf g (\hat {\boldsymbol \theta}) = \mathbf 0, \ \ \Rightarrow \ \ \mathbf g ( \hat {\boldsymbol \theta}) \approx \mathbf g (\boldsymbol \theta_T) + \underbrace{\mathbf H(\boldsymbol \theta_T)}_{\frac {\partial }{\partial \boldsymbol \theta} \mathbf g(\boldsymbol \theta_T)} (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = \mathbf 0 \ \ \Rightarrow
\mathbf g (\boldsymbol \theta_T) = -\mathbf H(\boldsymbol \theta_T)(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) \ \ \Rightarrow
(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = [-\mathbf H(\boldsymbol \theta_T)]^{-1} \mathbf g (\boldsymbol \theta_T) \ \ \Rightarrow
(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = [- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]^{-1} \frac{1}{n} \mathbf g (\boldsymbol \theta_T) \ \ \Rightarrow
\sqrt n (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = \underbrace {[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]^{-1}}_{\stackrel{p}{\longrightarrow} \ \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1}} \ \ \ \ \cdot \underbrace{\sqrt n\ \bar{\mathbf g} (\boldsymbol \theta_T)}_{\stackrel{d}{\longrightarrow} \ N(\mathbf 0,\, E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)])} \ \ \Rightarrow
\sqrt n (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) \ \stackrel{d}{\longrightarrow} \ N \left( \mathbf 0, \, \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1} \, E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)] \, \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1} \right) \ =
N (\mathbf 0, \, \lbrace E[-\frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1})
\mathbf g (\boldsymbol \theta) = \sum\nolimits_{i=1}^n \mathbf g_i(\boldsymbol \theta), \ \ \text{ with } \mathbf g_i(\boldsymbol \theta) := \dfrac {\partial \ln f(y_i|\mathbf x, \boldsymbol \theta)} {\partial \boldsymbol \theta}, \ \ \text{ and } \ \ \mathbf {\bar g} (\boldsymbol \theta) = \frac {1}{n} \sum\nolimits_{i=1}^n \mathbf g_i(\boldsymbol \theta)
\mathbf H(\boldsymbol \theta) = \sum\nolimits_{i=1}^n \mathbf H_i (\boldsymbol \theta), \ \ \text{ with } \mathbf H_i(\boldsymbol \theta) := \dfrac {\partial^2 \ln f(y_i|\mathbf x, \boldsymbol \theta)} {\partial \boldsymbol \theta \, \partial \boldsymbol \theta'}
\sqrt n\ \bar{\mathbf g} (\boldsymbol \theta_T) is \sqrt n times the average of random variables \mathbf g_i with an expected value of 0 and covariance matrix \operatorname {Var}\left( \mathbf g_i (\boldsymbol \theta_T) \right) = -E[\mathbf H_i (\boldsymbol \theta_T)] (by the Information matrix equality) and an average covariance matrix of -E[\frac{1}{n} \sum_i \mathbf H_i (\boldsymbol \theta_T)] = -E[\frac {1}{n} \mathbf H (\boldsymbol \theta_T) ]. Hence, we can invoke a CLT
-\frac{1}{n} \mathbf H(\boldsymbol \theta_T) is an average too and, by LLN, converges in probability to its expected value -E[\frac {1}{n} \mathbf H(\boldsymbol \theta_T)]
From the limiting distribution of \sqrt n (\hat {\boldsymbol \theta}_{LM} - \boldsymbol \theta_T) we get the asymptotic distribution of \hat{\boldsymbol \theta}_{ML}
\hat {\boldsymbol \theta}_{LM} \ \, \stackrel{a}{\sim} \ \, N \left( \boldsymbol \theta_T ,\, \lbrace -E[ \mathbf H(\boldsymbol \theta_T)] \rbrace ^{-1} \right)
As the expected value in the covariance matrix formula is generally difficult to obtain, the covariance matrix of the estimated parameters is simply estimated by the inverse of the actual (not the expected) negative Hessian matrix at \hat{\boldsymbol \theta}_{LM}
\widehat{\operatorname {Var} (\hat{\boldsymbol \theta})} \ = \ -\mathbf H(\hat {\boldsymbol \theta})^{-1} \ =
\left( -\dfrac {\partial^2 \ln L(\hat{\boldsymbol \theta})}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right)^{-1} \ = \ \left( - \dfrac { \sum_{i=1}^n \partial^2 \ln f(y_i | \mathbf x; \hat{\boldsymbol \theta})}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right)^{-1} \ = \ \left( -\sum_{i=1}^n \mathbf H_i(\hat {\boldsymbol \theta}) \right)^{-1}
with f(y_i | \mathbf x; \boldsymbol \theta) being the conditional distribution function of observation i
\widehat{\operatorname {Var} (\hat{\boldsymbol \theta})} \ = \ \left( \sum_{i=1}^n \mathbf g_i(\hat{\boldsymbol \theta}) \, \mathbf g_i(\hat{\boldsymbol \theta})' \right)^{-1}
Why does the variance of the estimated parameters depend on the second derivatives?
The second derivative measures how concave the function is. If this is (in absolute terms) small at the optimum (nearly linear, with a slope of 0), the likelihood function is flat at the optimum. This means that there is a large interval of other parameters having nearly the same likelihood. Hence, the variance of the estimated \theta is high
On the other hand, if the second derivative is large in absolute terms, the likelihood function is very curved at the optimum, meaning that there is only a small interval of parameters having nearly the same likelihood. Therefore, the estimate is precise and the variance of \hat \theta is low
Hence, we have an inverse relationship between the size of the second derivatives of the likelihood function and the variance of the estimated parameters; and this inverse rekationship is reflected with the covariance matrix formula
As we already know, the second order condition for maximization of the likelihood function requires that the Hessian matrix is negative definite, so that the likelihood function is concave at the optimum
For the Logit model we have: \mathbf H = - \sum_i \Lambda(\mathbf x_i \boldsymbol \beta) (1-\Lambda(\mathbf x_i \boldsymbol \beta) )\mathbf x_i \mathbf x_i', which is clearly negative definite for all \boldsymbol \theta
As similar result applies to the Probit model which means that the likelihood functions are always concave for the two models, guaranteeing a global maximum
Because we have an estimate of the covariance matrix of \hat {\boldsymbol \theta} we can carry out all of the usual Wald tests about linear hypothesis (after linearization, nonlinear restrictions are also possible - the restriction matrix \mathbf R is replaced by the Jacobi Matrix of the nonlinear restrictions)
With this test procedure, only the unrestricted model have to be estimated. If the restriction is not compatible with the data, i.e., the restricted estimate \mathbf R \hat{\boldsymbol \theta} is too far away from the hypothesized value, the restriction is rejected
The big advantage of this test strategy is that with a large enough sample no assumption about the distribution is required (we do not need the likelihood function)
In the context of MLE, we have an estimate of the log-likelihood of the observed sample. This can be exploited for a Likelihood Ratio (LR) test
LR \ = \ 2 \left( \ln L(\hat{\boldsymbol \theta}_{ur}) - \ln L(\hat{\boldsymbol \theta}_r) \right) \ \stackrel{a}{\sim} \ \chi^2(m)
The third test concept, which is also derived from the Maximum Likelihood principle, is the Lagrange Multiplier test (LM)
\underset{\boldsymbol \theta}{\max}: \ \ln L(\boldsymbol \theta; \mathbf y, \mathbf X) - \boldsymbol \lambda' \mathbf c(\boldsymbol \theta)
\underbrace {\dfrac {\partial }{\partial \boldsymbol \theta} \ln L(\boldsymbol \theta; \mathbf y, \mathbf X)}_{\mathbf g(\boldsymbol \theta)} - \boldsymbol \lambda' \dfrac {\partial }{\partial \boldsymbol \theta} \mathbf c(\boldsymbol \theta) = \mathbf 0
\dfrac {\partial }{\partial \theta_j} \ln L(\boldsymbol \theta; \mathbf y, \mathbf X) = \lambda_{\tilde j}
Thus, the vector \boldsymbol \lambda, which is often interpreted as shadow prices for the restrictions \theta_j = c_j, is equal to the corresponding derivatives of the log-likelihood function. Therefore, \lambda_{\tilde j} measures the extent to which the first order conditions with respect to to the unconstrained model are violated (without restriction, the corresponding derivative should differ from \mathbf 0 only because of sampling errors)
Hence, one idea is to test whether \boldsymbol \lambda is zero. A similar and more common test is directly based on the gradient vectors \mathbf g(\boldsymbol \theta_r), the score vectors, instead of \boldsymbol \lambda; hence, the label score test. This as well measures to what extent the gradient differs from \mathbf 0, the expected value of \mathbf g(\boldsymbol \theta) in the unconstrained optimum
It turns out that \mathbf g(\boldsymbol \theta) is asymptotically normal distributed and we have a “Wald-like” LM test statistic
LM \ = \ \mathbf g(\boldsymbol \theta_r)' \, \mathbf I(\boldsymbol \theta_r)^{-1} \mathbf g(\boldsymbol \theta_r) \ \ \stackrel{a}{\sim} \ \ \chi^2(m)
with \mathbf I(\boldsymbol \theta_r) equal to of the inverse Information matrix, evaluated at the restricted parameter vector
An asymptotic equivalent procedure to the score test is to first run a restricted regression and then, in a second step, an auxiliary regression using the residuals of the first step to test for relaxation of the restrictions
Typically, the test statistic for this variant of the LM test is \, n \cdot R^2, which is \chi^2(m) distributed
This procedure is especially suited for testing misspecifications, as in this case the restricted variant is typically more easily to estimate than the unrestricted model. Examples are tests for
Heteroskedasticity (Breusch-Pagan test)
Serial correlation (Breuch-Godfrey test)
General misspecifications (RESET)
Endogenous regressors (Hausman-Wu test)
Tests for exclusions of variables or linear restrictions among variables are typically more easily to carry out with a Wald test
All three test procedures introduced – Wald tests, LR tests and LM tests – are only suitable for nested models and are asymptotically equivalent
library(magick)
= image_read("topic3_tests.jpg")
img image_scale(img, "600")
We already dealed with the hit rate in the context binary choice models, with the refinements of sensitivity and specificity
Pseudo R-squared: Compare the maximized log-likelihood of the model with that of a model that only contains a constant (and no explanatory variables); McFadden’s \tilde{R}^2
\tilde{R}^{2}=1- \dfrac {\ln L_{u r}} {\ln L_{0}}
Correlation based measures: Look at correlation (or squared correlation) between predicted values or predicted probabilities versus observed values
\operatorname {Corr} (y_i, \hat y_i)
Information criteria:
Akaike information criterion, AIC = -2 \ln L + 2k
Bayesian (Schwarz) information criterion BIC = -2 \ln L + k \ln n
Both are a measure of the information loss due to an imperfect model. They deal with the trade-off between the goodness of fit and the simplicity of the model, thereby considering both the risk of overfitting and underfitting
They are based on the log-likelihood of the model but place a penalty to the number of estimated parameters like the R-squared bar
These criteria can be used to judge also on non-nested models (if y is the same): The lower the better – this strategy is said to lead to better out of sample predictions
For an introduction to Bayesian statistics watch that video:
https://www.youtube.com/watch?v=Pahyv9i_X2k
As the video above shows, the basic idea of Bayesian statistics is to update prior information about a parameter vector \boldsymbol \theta with observed data. Bayes Theorem implies that the posterior distribution of the parameter vector \boldsymbol \theta, f(\boldsymbol \theta|data), is proportional to the likelihood function f(data|\boldsymbol \theta) times the prior distribution f(\boldsymbol \theta). An important economic application is estimation of DSGE models.
However, the posterior distribution is often very complicated and difficult or impossible to integrate. For instance, we need to integrate to get the expectation of the posterior E(\boldsymbol \theta|data) = \int_\theta \boldsymbol \theta f(\boldsymbol \theta|data) d\boldsymbol \theta.
A solution of this problem are sampling algorithms, which draw samples of \boldsymbol \theta from an arbitrary proposal distribution in a way so that we finally get draws that are asymptotically distributed like the posterior distribution. If we have such a large sample of \boldsymbol \theta from the posterior it is very easy to get the expectation of \boldsymbol \theta (its simply the mean) or credible intervals (the Bayesian counterparts to confidence intervals).
The Markov Chain Monte Carlo (MCMC), and in particular the Metropolis Hastings (MH) algorithms are such powerful sampling algorithms. By the way, this algorithms are also very important in many other fields like climate modeling, physics or big data econometrics. To learn more watch the following video:
https://www.youtube.com/watch?v=U561HGMWjcw
Count data arise when the dependent variable is a counting number per unit of time or space – e.g., the number of restaurants meals taken by a consumer during a week or the number of arrest of people during a year. So, y_i = \lbrace 0, 1, 2, \ldots, \rbrace
If these numbers are large and widespread, the variable can be considered as continuous and hence, OLS would be appropriate
If these numbers are small, quite rare events, then the continuous approximation is a poor one
In this case, OLS can still yield consistent estimates and the predicted values are interpreted as conditional expected values of the independent variable
However, OLS does not account for the special structure of the data, leading to inefficiency and nonsense predictions, like -0.2 restaurant meals per week
One remedy to avoid negative predictions is to postulate
E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}
and apply NLS. However, this would be not efficient as the the error term is heteroscedastic
Therefore, maximum likelihood estimation is usually employed for this type of models. As count data are often Poisson distributed this leads to the Poisson regression model (or to the more general negative binomial regression model)
The Poisson distribution is derived form a limit of the binomial distribution, which we already encountered with the Logit and Probit models. If the number of independent (Bernoulli) draws goes to infinity and the success probability p goes to zero, so that n\cdot p \rightarrow \lambda, the binomial distrib. converges to
P(Y = y_i) = \dfrac {\lambda^{y_i}}{y_i!} e^{-\lambda}
And this is called Poisson distribution, with y_i the number of counts
The parameter \lambda is equal to both the expectation and the variance of counts! This distribution is considerably right-skewed for a small \lambda, but for \lambda>30, it’s almost identical to a N(\lambda, \lambda) normal distribution
As we are interested in the effects of the explanatory variables on y (the number of counts), we must look at the Poisson distribution conditional on \mathbf x. We only have to specify \lambda = E[y \, | \, \mathbf x], which we set to
\lambda = E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}
as above – \lambda has to be positive!
Given a sample of n observations, we can construct the log-likelihood function, which is the joint log-probability that each y_i is observed
\ln L(\boldsymbol \beta) \, = \, \sum\nolimits_{i=1}^{n} y_i \ln \lambda - \lambda - \ln(y_i!) \, = \, \sum\nolimits_{i=1}^{n} y_i \, {\mathbf x_i \boldsymbol \beta} - e^{\mathbf x_i \boldsymbol \beta} - \ln(y_i!)
Note, that the last term, \ln(y_i!), is independent of \boldsymbol \beta and could be omitted
The estimates \hat{\boldsymbol \beta} are obtained by numerical maximization of the log-likelihood function
As we have
E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}
the partial effects are
\dfrac {\partial E[y \, | \, \mathbf x]}{\partial x_j} = {\underbrace {e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}}_{E[y \, | \, \mathbf x]}} \, \beta_j \ \ \Rightarrow \tag{10.2}
\dfrac {\partial E[y \, | \, \mathbf x]}{\partial x_j} \dfrac {1}{E[y \, | \, \mathbf x]} \ = \ \dfrac {\partial \ln E[y \, | \, \mathbf x]}{\partial x_j}{} \ = \ \beta_j \tag{10.3}
This is a semi-elasticity. A change in x_j of 0.01 leads to a \beta_j percentage change in E[y \, | \, \mathbf x]. For larger variations in x_j (e.g. for a dummy variable) use the exact formula (e^{\beta_j \Delta x_j}-1)100=\Delta%
The Poisson regression model assumes that \operatorname{Var}(y|\mathbf x) = E(y|\mathbf x) which is very restrictive and often violated in applied work
If \operatorname{Var}(y|\mathbf x) > E(y|\mathbf x), we have overdispersion (which is quite common), in the reverse case underdispersion (which is not so common)
Fortunately, MLE of the Poisson regression model are consistent and asymptotically normal even if the Poisson distribution does not hold. This is then called Quasi-Maximum Likelihood estimation (QML). (This is similar to the case of OLS when the disturbance is not normally distributed)
If the distributional assumption does not hold and QML is used, standard errors are wrong. One then has to compute adjusted standard errors
The residuals of the Poisson model are as usual \hat u_i = y_i-\hat y_i. To judge whether there is overdispersion the residuals are scaled (divided) by \sqrt{\hat y_i}. This is a proper heteroscedasticity correction. Additionally, \hat y is a predictor for E(y) and therefore also for \operatorname{Var}(y|\mathbf x). Hence, the variance of these scaled residuals should be one if the condition \operatorname{Var}(y|\mathbf x) = E(y|\mathbf x) is true
If the calculated standard error of these transformed residuals \hat \sigma is larger than one we have overdispersion. In this case the estimated standard errors of the coefficients should be multiplied by \hat \sigma
As an example we analyze the crime1
data set of the Wooldridge
Package. The dependent variable narr86
is the number of times a man is arrested during 1986. This variable is zero for 1970 of the 2725 men in the sample and only eight values of narr86
are larger than five. So, the Poisson regression model is more appropriate than OLS
Some interesting exploratory variables are
pcnv
: proportion of prior convictionstottime
: time in prison since 18, in monthptime86
: month in prison during 1986inc86
: legal income 1986, in $100sblack
: =1 if blackhispan
: =1 if HispanicWe estimate the model with OLS, as well as Poisson regression model and Quasi-Poisson regression model, the latter only differ by the correction of the standard errors because of overdispersion
Additionally, we calculate the APE, which are comparable to the OLS estimates
The OLS estimate for pcnv
is -0.132 which means that with \Deltapcvn
= 0.1, the expected number of arrest during 1986 decreases by 0.013
The corresponding coefficient of the Poisson model is 0.402. But this is now a semi-elasticity: With \Deltapcvn
= 0.1, the expected number of arrest during 1986 decreases by 4.02%. Calculating the APE (which averages Equation 10.2) with the Package marginaleffects
, we get the absolute effect of -0.162, which is now comparable to the OLS estimate
The effect of black
in the OLS model is 0.327, meaning that for black men the expected number of convictions is 0.327 higher than for white men
The Poisson coefficient for black
is 0.661. In this case (a large \Delta x_j), we should use the precise formula, e^{\beta_j \Delta x_j}-1, leading to exp(0.661)-1, which is about 94% higher than for white man. Calculating the APE (which averages Equation 10.2), we get the effect in absolute terms, 0.328, which is about the same as the estimated OLS effect
The SE of the transformed residuals of the Poisson model is 1.232, showing a moderate overdispersion. The column QPoisson shows the corrected standard errors for the estimated \beta_j
The estimated R^2, which is the squared correlation of actual versus predicted values of narr86
, shows that the overall fit of the Poisson regression model is better than that of the linear regression model, although the OLS coefficients are chosen to maximize R^2. But OLS is restricted to a linear relationship
library(wooldridge)
data("crime1")
# OLS
<- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 +
outlm + black + hispan + born60 ,
inc86 data = crime1)
# Poisson - family = poisson and the default link is "log",
# so the inverse default link is "exp"
<- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 +
outpoisson + black + hispan + born60 ,
inc86 data = crime1, family = poisson)
# Quasi-Poisson - this only adjusts the standard errors
<- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 +
outquasipoisson + black + hispan + born60 ,
inc86 data = crime1, family = quasipoisson)
# Calculating APE with "marginaleffects"
<- avg_slopes(outlm)
marg_lm <- avg_slopes(outpoisson)
marg_pois <- avg_slopes(outquasipoisson) marg_qpois
# Reset for modelsummery for glm()
<- function(x, ...) {
glance_custom.glm
}
modelsummary( list("OLS"=outlm, "Poisson"=outpoisson, "Poisson APE"=marg_pois,
"QPoisson"=outquasipoisson, "QPoisson APE"=marg_qpois),
output="flextable",
gof_omit = "F",
align = "lddddd",
stars = TRUE,
fmt = 3)
| OLS | Poisson | Poisson APE | QPoisson | QPoisson APE |
---|---|---|---|---|---|
(Intercept) | 0.577*** | -0.600*** | -0.600*** | ||
(0.038) | (0.067) | (0.083) | |||
pcnv | -0.132** | -0.402*** | -0.162*** | -0.402*** | -0.162*** |
(0.040) | (0.085) | (0.035) | (0.105) | (0.043) | |
avgsen | -0.011 | -0.024 | -0.010 | -0.024 | -0.010 |
(0.012) | (0.020) | (0.008) | (0.025) | (0.010) | |
tottime | 0.012 | 0.024+ | 0.010+ | 0.024 | 0.010 |
(0.009) | (0.015) | (0.006) | (0.018) | (0.007) | |
ptime86 | -0.041*** | -0.099*** | -0.040*** | -0.099*** | -0.040*** |
(0.009) | (0.021) | (0.008) | (0.025) | (0.010) | |
qemp86 | -0.051*** | -0.038 | -0.015 | -0.038 | -0.015 |
(0.014) | (0.029) | (0.012) | (0.036) | (0.014) | |
inc86 | -0.001*** | -0.008*** | -0.003*** | -0.008*** | -0.003*** |
(0.000) | (0.001) | (0.000) | (0.001) | (0.001) | |
black | 0.327*** | 0.661*** | 0.328*** | 0.661*** | 0.328*** |
(0.045) | (0.074) | (0.045) | (0.091) | (0.055) | |
hispan | 0.194*** | 0.500*** | 0.235*** | 0.500*** | 0.235*** |
(0.040) | (0.074) | (0.040) | (0.091) | (0.050) | |
born60 | -0.022 | -0.051 | -0.020 | -0.051 | -0.020 |
(0.033) | (0.064) | (0.025) | (0.079) | (0.031) | |
Num.Obs. | 2725 | 2725 | 2725 | 2725 | 2725 |
R2 | 0.072 | ||||
AIC | 6721.4 | 4517.5 | 4517.5 | ||
BIC | 6786.4 | 4576.6 | 4576.6 | ||
Log.Lik. | -3349.678 | -2248.761 | -2248.761 | ||
RMSE | 0.83 | 0.83 | 0.83 | 0.83 | 0.83 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
In many economic contexts, decision problems are such that either a positive amount or a zero amount is chosen, e.g., the demand for cars or alcohol (corner solutions). In this case, we have a form of censored data and the distribution of car expenditures is a combination of a discrete distribution (at 0) and a continuous distribution
Another form of censored data is when the dependent variable is censored from above or below a certain threshold, typically due to survey design; e.g. if respondents of a survey are allowed to respond a question about their wealth with more than 1 Mil. Typically, there is no optimization problem modeled
In the first example for censored data we have expenditure data for buyers and non-buyers (which are zero). However, if the data are collected from sales tax records, then the data would only include buyers
Data, in which observations are unavailable above or below a certain threshold are called truncated data. Note, for truncated data we neither observe y_i nor \mathbf x_i, in contrast to censored data. The sample is not random any more if the selection mechanism is related to the value of the dependent variable
Remark: Selection mechanisms which are exogenous, i.e., the dependent variable and its error term are not involved, pose no special estimation problems
Example:
Suppose we want to estimate the relationship between the saving rate and income, but we have oversampled people with high income (exogenous sampling mechanism, which is based on an exogenous (rhs) variable). If the same relationship between savings and income is true for all people, this would not lead to biased estimates, (although with higher variance – less variation in the explanatory variable)
However, if we have oversampled people with a high saving rate (endogenous sampling mechanism) this would lead to biased estimates even if the same relationship between savings and income is true for all people (sample selection problem)
A linear regression model may be inadequate for both censured and truncated data
Figure 10.4 below shows the effects of censored and truncated data for estimating the true (population) relationship. In both cases we see a significant bias towards zero in the estimates based on the observed data. Note, regarding the truncated sample, we have an endogenous sampling mechanism related to the value of y – thus there is a sample selection problem
library("ggplot2")
library("dplyr")
library("tidyverse")
<- FALSE
baw set.seed(2020)
<- 1000
N = rnorm(N, 3, 4)
X = - 1 + 0.5 * X + rnorm(N, 0, 1.8)
Y <- data.frame(X = X, Y = Y)
z
<- mutate(z,
z neg = factor(Y < 0, labels = c("yes", "no")),
Yc = ifelse(Y > 0 , Y, 0),
Yt = ifelse(Y > 0, Y, NA))
<- gather(z, key, value, -X, - neg)
z2
<- ifelse(baw, "black", "#AA4444")
colols
<- c("Y" = "Whole sample", "Yc" = "Censored sample", "Yt" = "Truncated sample")
ys
<- ggplot(z2, aes(X, value)) + geom_point(size = 0.6, aes(colour = neg)) +
gp geom_smooth(method = "lm", se = FALSE, lty = "twodash", color = colols) +
geom_abline(intercept = -1, slope = 0.5) +
facet_wrap(~ key, ncol = 3, labeller = labeller(key = ys)) +
guides(colour = FALSE) + xlab("") + ylab("") +
theme(legend.text = element_text(size = 6),
legend.title= element_text(size = 8),
axis.title = element_text(size = 8))
if (baw) gp <- gp + scale_color_grey()
gp
library(wooldridge);
data("crime1")
# Number of arrest - typical count data
hist(crime1$narr86, main = "Typical count data, should look like a Poisson distribution")
library(wooldridge);
data("mroz")
# Number of hours worked - not everybody work
# Data apparently not Poisson distributed
hist(mroz$hours, main = "Typical censored data, (>= 0)")
library(wooldridge);
data("recid")
# Duration time until next arrest. We have different release times from prison,
# hence different thresholds as the endpoint for observations is fixed
hist(recid$durat, main = "Censored data from above with different but known thresholds")
Let’s start with the analysis of truncated regression models. Characteristic for these is that the outcome y_i and the explanatory variables \mathbf x_i are only observed if the outcome y_i is less (larger) or equal some value c_i
In this case, the sample is not a random sample from the population (because some units will never be a part of the sample)
The truncated normal regression model is:
y_i = \mathbf x_i \boldsymbol \beta + u_i
u_i \, | \, \mathbf x_i \sim N(0, \sigma^2)
(y_i, \mathbf x_i) \ \text{ only observed if } \ y_i > c_i \ \ \ (\text{or } \ y_i < c_i) \tag{10.4}
Truncation from below or left: observe y_i and \mathbf x_i if y_i > c_i
Truncation from above or right: observe y_i and \mathbf x_i if y_i < c_i
In the following we examine truncation from below. Truncation from above is handled analogous and does not require any new results, Green (2017)
Theorem 10.1 (Truncated normal distribution, see Green (2017)) Let y be a normally distributed random variable with density f(y), mean E(y)=\mu and variance \operatorname {Var}(y)=\sigma^2 and a cutoff constant c. Then the conditional density function f(y|y>c) is
f(y) \ \equiv \ f(y|y>c)\cdot P(y>c) \ \ \Rightarrow
f(y|y>c) = \frac{f(y)}{P(y>c)} = \frac{f(y)}{P(y-\mu>c-\mu)} =
\frac{f(y)}{P[(y-\mu)/\sigma > (c-\mu)/\sigma]} \, = \, \frac{f(y)}{1-F[(c-\mu)/\sigma]} \, =
\frac{\frac{1}{\sigma}\phi \left( \dfrac{y-\mu}{\sigma}\right)}{1-\Phi\left(\dfrac{c-\mu}{\sigma}\right)} \ \tag{10.5}
As already defined above, \phi(z) denotes the probability density and \Phi(z) the cumulative distribution function (cdf) of the standard normally distribution (see Equation A.3 and Equation A.4).
The mean and variance are given by:
\begin{align*}E(y|y>c) &= \mu + \sigma \, \lambda \left(\frac{\mu-c}{\sigma} \right) \\ \operatorname{Var}(y|y>c) &= \sigma^2 \left[ 1-\delta \left(\frac{\mu-c}{\sigma}\right) \right] \end{align*} \tag{10.6}
where \lambda(x) is the Inverse Mills Ratio (or hazard function of the standard normal distribution): 1
\lambda(x) := \dfrac {\phi(x)}{\Phi(x)}, \ \ \lambda(x) >0, \ \lambda'(x) <0, \ \forall x \tag{10.7}
\delta(x) := \lambda(x)[\lambda(x)-x], \ \ 0<\delta(x)<1
Remarks: E(y|y>c) > E(y) and \operatorname {Var}(y|y>c) < \operatorname {Var}(y).
The Inverse Mills Ratio \lambda(x) (IMR) is fundamental in models with incomplete data or self-selective samples. It can be interpreted as a measure of the probability that a particular observation is selected (is in the sample) compared to those that are not selected (not in the sample). A lower IMR indicates a higher probability of selection.
<- seq(-3, 3, length=100)
x
# plotting normal distribution
plot(x, dnorm(x), type="l", ylim=c(0, 1.125))
# plotting truncated normal distribution, truncated from below at -0.5
=-0.5
a<- seq(a, 3, length=100)
xt lines(xt, dnorm(xt)/(1-pnorm(a)) )
<- dnorm(a)/(1-pnorm(a))
o segments( a,0, x1=a, y1=o, lty = 3)
# plotting truncated normal distribution, truncated from below at 0.5
=0.5
a<- seq(a, 3, length=100)
xt lines(xt, dnorm(xt)/(1-pnorm(a)) )
<- dnorm(a)/(1-pnorm(a))
o segments( a,0, x1=a, y1=o, lty = 3)
Applying OLS would not yield correct results because MLR.2 is violated (see Section 2.7). To see this note that, according to the theorem above, the expectation of y_i for the observed subpopulation y_i > c_i is:
E(y_i|y>c_i) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \lambda \left( \frac{\mathbf x_i \boldsymbol \beta - c_i }{\sigma} \right) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \tag{10.8}
Hence, regressing y_i only on \mathbf x_i leads to an omitted variable bias as the omitted variable \lambda(\mathbf x_i) (the inverse Mills Ratio, Equation 10.7) is a nonlinear function of \mathbf x_i and therefore probably correlated with \mathbf x_i
The partial effect of x_j on E(y_i|y>c_i), which we want to estimate, is:
\dfrac {\partial E(y_i|y>c_i)}{\partial x_j} \, = \, \beta_j + \sigma \, \lambda' \, \dfrac {\partial \left( \frac{\mathbf x_i \boldsymbol \beta - c_i }{\sigma} \right) }{\partial x_j} \, = \, \beta_j(1 + \underset {(-)} {\lambda'}) \tag{10.9}
As just shown, OLS is biased and inconsistent for truncated data because of an omitted variable problem.
Thus, Maximum Likelihood estimation seems appropriate.
f\left(y_{i} \, | \, \mathbf{x}_{i}, y_i > c_{i}\right) \ = \ \frac{\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} {1 - \Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \tag{10.10}
Maximizing the log-Likelihood function yields consistent estimates of \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2
\underset{\boldsymbol \beta, \sigma}{\max} \ \ln L(\boldsymbol \beta, \sigma^2) \ = \ \sum_{i=1}^n \ln \frac {1}{\sigma }{\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} - \ln \left[ 1 - {\Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \right]
\rightarrow \ \ \ \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2
Note, that non-normality or heteroscedasticity in u_i lead to inconsistent estimates
Further note, the untruncated data should be approximately continuous to be appropriately modeled with this approach (not count data with only few counts!)
# We simulate data and store them in the data.frame d
set.seed(1234890)
<- 10000
n
# true parameters, which we want to estimate
<- 4
sigma <- 2
beta_0 <- 1;
beta_1
<- rnorm(n, mean = 0, sd = 2);
x <- rnorm(n, sd = sigma)
eps
# the true model
<- beta_0 + beta_1 * x + eps
y
<- data.frame(y = y, x = x) d
library(truncreg) # Truncated Regression package
# generating truncated data y = yt, at c <= 1
$yt <- ifelse(d$y > 1, d$y, NA)
d
# compare estimates for full/truncated/ data
# we use the function `truncreg` from the package `truncreg`
<- lm(y ~ x, data = d)
full <- lm(yt ~ x, data = d)
ols <- truncreg(yt ~ x, data = d, point = 1, direction = "left")
trunc
summary(trunc)
Call:
truncreg(formula = yt ~ x, data = d, point = 1, direction = "left")
BFGS maximization method
45 iterations, 0h:0m:0s
g'(-H)^-1g = 7.38E-09
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 2.119760 0.170904 12.403 < 2.2e-16 ***
x 0.994784 0.046537 21.376 < 2.2e-16 ***
sigma 3.972827 0.079421 50.022 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -13547 on 3 Df
modelsummary( list("OLS full data"=full, "OLS trunc data"=ols, "MLE trunc data"=trunc),
output="flextable",
gof_map = c("nobs", "aic", "bic", "rmse"),
align = "lddd",
stars = TRUE,
fmt = 3)
| OLS full data | OLS trunc data | MLE trunc data |
---|---|---|---|
(Intercept) | 2.081*** | 4.707*** | 2.120*** |
(0.040) | (0.037) | (0.171) | |
x | 1.035*** | 0.474*** | 0.995*** |
(0.020) | (0.019) | (0.047) | |
sigma | 3.973*** | ||
(0.079) | |||
Num.Obs. | 10000 | 5967 | 5967 |
AIC | 56035.8 | 29009.9 | 27099.7 |
BIC | 56057.4 | 29030.0 | 27119.8 |
RMSE | 3.99 | 2.75 | 3.69 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Generally, censored data do not result from a truncation problem. Contrary to truncated data, especially the explanatory variables \mathbf x_i are observed even if the outcome y_i is less (larger) or equal some value c_i
Typically, a dependent variable is censored from above or below a certain threshold due to survey design; e.g., values are only reported up to a certain level (e.g., top coded wealth)
With y_i^* being the true outcome and y_i being the observed outcome, the censored normal regression model is
y_i^* = \mathbf x_i \boldsymbol \beta + u_i
u_i \, | \, \mathbf x_i \sim N(0, \, \sigma^2)
y_i = \begin{cases} \max(y_i^*,c_i) \ \ \ \text{:left censoring (from below)} \\ \min(y_i^*,c_i) \ \ \ \ \text{:right censoring (from above)} \\ \max(y_i^*,\underline {c}_i) \ \land \ \min(y_i^*,\bar c_i) \ \ \ \text{:censoring from above and below} \end{cases} \tag{10.11}
For censored data, the distribution function of y_i is a mixture of a discrete distribution, concentrated at the threshold c_i, and a truncated continuous density; the normal density in our case
Such a distribution is shown in Figure 10.6 above with threshold c_i=0
We have two parts of the distribution of y_i (all conditional on \mathbf x_i, which we leave out to save notation):
P(y_i = c_i) \,=\, P(y_i^* \leq c_i) \,=\, P(u_i \leq c_i - \mathbf x_i \boldsymbol \beta) \,=\, P \left( \frac {u_i}{\sigma} \leq \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \,=
\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.12}
For the whole two-part distribution of y_i we have
f(y_i \,|\, \mathbf x_i, c_i) \ = \begin{cases} P(y_i=c_i) = \Phi \left( \dfrac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right); \ \ \text{ if } y_i=c_i \\ f(y_{i} \, | \, y_i > c_{i}) \cdot P(y_i>c_i); \ \ \text{ if } y_i>c_i \end{cases} \tag{10.13}
f(y_{i} \, | \, y_i > c_{i}) \cdot P(y_i>c_i) \ = \ \underbrace {\frac{\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} {1 - \Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)}}_{ f(y_{i} \, | \, y_i > c_{i})} \underbrace {\left [1 - \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \right ]}_{P(y_i>c_i)} \ =
\underbrace {\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)}_{f(y_i^*)=N(\mathbf x_i \boldsymbol \beta, \, \sigma^2)} \tag{10.14}
This lead to the joint density of the observed vector \mathbf y:
f(\mathbf y \, | \, \mathbf X,c_i) \ = \ \prod_{y_i=c_i} \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \cdot \prod_{y_i>c_i} \frac{1}{\sigma} \phi \left( \frac {y_i-\mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.15}
\underset{\boldsymbol \beta, \sigma}{\max} \ \ln L(\boldsymbol \beta, \sigma^2) \ = \ \sum_{y_i=c_i} \ln \left[ {\Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \right] \ + \ \sum_{y_i>c_i}^n \ln \frac {1}{\sigma }{\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} \ \ \rightarrow \ \ \hat {\boldsymbol \beta}, \hat \sigma^2
Estimation has to be done by numerical optimization methods
After estimation, parameter test and exclusion restrictions can be done as usual with Wald, Likelihood Ratio or Lagrange Multiplier tests
Note, if the distributional assumptions on u_i (normality and homoscedasticity) are violated, the MLE above is inconsistent
Further note, the uncensored data should be approximately continuous to be appropriately modeled with this approach (not count data with only few counts!)
OLS applied to the observed data doesn’t yield consistent estimates for \boldsymbol \beta as the expectation of y_i, conditional on \mathbf x_i and y_i\geq c_i, is not \mathbf x_i \boldsymbol \beta but a highly nonlinear function in \mathbf x_i and \sigma
E(y_i) \ = \ E(y_i|y \geq c_i) \ = \ E(y_i|y_i > c_i) \cdot P(y_i > c_i) \, + \, \underbrace {E(y_i|y_i = c_i)}_{c_i} \cdot P(y_i = c_i)
Plugging in our known results regarding the truncation case E(y_i|y_i > c_i) (Equation 10.8) and for P(y_i = c_i) (Equation 10.12), respectively P(y_i > c_i), we have
E(y_i) \ = \ \underbrace {\left[ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \right]}_{E(y_i|y_i > c_i)} \underbrace {\left[ 1 -\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \right]}_{P(y_i>c_i) \, = \, [1-P(y_i=c_i)]} \, + \, c_i \, \underbrace {\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right)}_{P(y_i=c_i)} \ =
\left[ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \right] \underbrace {\left[ \Phi \left( \frac {\mathbf x_i \boldsymbol \beta - c_i}{\sigma} \right) \right]}_{P(y_i>c_i)} \, + \, c_i \, \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \ =
\mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma} \right) \, + \, c_i \, \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.16}
# simulated data - example from above continued
set.seed(1234890)
<- 1500 # fewer observation to better judge efficiency
n
# true parameters, which we want to estimate
<- 4
sigma <- 2
beta_0 <- 1;
beta_1
<- rnorm(n, mean = 0, sd = 2);
x <- rnorm(n, sd = sigma)
eps
# the true model
<- beta_0 + beta_1 * x + eps
y
<- data.frame(y = y, x = x) d
library(truncreg) # Truncated Regression package
library(censReg) # Censored Regression package
## truncated y = yt, censored y = yc, at c = 1
$yt <- ifelse(d$y > 1, d$y, NA)
d$yc <- ifelse(d$y > 1, d$y, 1)
d
## compare estimates for full/truncated/censored data
# ols full data
<- lm(y ~ x, data = d)
full
# ols truncated data
<- lm(yt ~ x, data = d)
ols
# ols censored data
<- lm(yc ~ x, data = d)
olsc
# MLE truncated data
<- truncreg(yt ~ x, data = d, point = 1, direction = "left")
trunc
# MLE censored data
<- censReg(yc ~ x, data = d, left = 1)
censored
summary(censored)
Call:
censReg(formula = yc ~ x, left = 1, data = d)
Observations:
Total Left-censored Uncensored Right-censored
1500 594 906 0
Coefficients:
Estimate Std. error t value Pr(> t)
(Intercept) 2.08976 0.11893 17.57 <2e-16 ***
x 0.90929 0.05969 15.23 <2e-16 ***
logSigma 1.38363 0.02540 54.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Newton-Raphson maximisation, 5 iterations
Return code 1: gradient close to zero (gradtol)
Log-likelihood: -2975.228 on 3 Df
modelsummary( list("OLS full data"=full, "OLS trunc data"=ols, "MLE trunc data"=trunc,
"OLS cens data"=olsc, "MLE cens data"=censored),
output="flextable",
gof_map = c("nobs", "aic", "bic", "rmse"),
align = "lddddd",
stars = TRUE,
fmt = 3)
| OLS full data | OLS trunc data | MLE trunc data | OLS cens data | MLE cens data |
---|---|---|---|---|---|
(Intercept) | 2.039*** | 4.716*** | 2.305*** | 3.352*** | 2.090*** |
(0.104) | (0.096) | (0.419) | (0.071) | (0.119) | |
x | 0.964*** | 0.397*** | 0.827*** | 0.526*** | 0.909*** |
(0.053) | (0.050) | (0.116) | (0.036) | (0.060) | |
sigma | 3.920*** | ||||
(0.201) | |||||
logSigma | 1.384*** | ||||
(0.025) | |||||
Num.Obs. | 1500 | 906 | 906 | 1500 | 1500 |
AIC | 8439.7 | 4391.6 | 4117.3 | 7274.4 | 5956.5 |
BIC | 8455.6 | 4406.0 | 4131.8 | 7290.3 | 5972.4 |
RMSE | 4.02 | 2.72 | 3.55 | 2.73 | |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Strictly speaking, the Tobit model, Tobin (1958), is formally identical to a censored normal regression model. However, there are two distinct features, one minor and one important
In most cases, the Tobit model is formulated with left-censoring and c_i=0. This is not important and can easily be changed
More importantly, the Tobit model is based on an optimization model with corner solutions and not on censoring of data above or below a certain threshold due to certain survey designs
This different starting point has serious consequences for interpreting the estimation results, especially for calculating the relevant partial effects
Let us consider an example for an underlying optimization problem with corner solutions for some agents i which maximize their utility U(y_i, z_i) subject to a budget constraint p_yy_i+p_zz_i \leq I_i
y_i and z_i could be leisure or working time versus consumption of goods, or consumption of tobacco versus other goods, or consumption today versus consumption tomorrow, ect. The critical point are additional non-negativity constraints; it is not possible to choose negative values, hence, y_i≥0 and z_i≥0
Without the non-negativity constraints the solution of this optimization problem for y_i is (assuming a linear solution which is to be expected with certain utility functions)
y_i^* = f(p_y, p_z, I_i, u_i) = \mathbf x_i \boldsymbol \beta + u_i
y_i^* = \mathbf x_i \boldsymbol \beta + u_i \\ y_i = y_i^* \quad \text{if } \ y_i^* > 0 \\ y_i = 0 \quad \ \text{if } \ y_i^* ≤ 0
This has consequences for the interpretation of the model. In the standard censored regression model we are interested in the population parameter \boldsymbol \beta, which, in the case of censored or truncated data can not be consistently estimated by OLS. Therefore, we need a maximum likelihood approach
In the Tobit model, with an underlying optimization problem with non-negativity constraints, we are typically interested on how much the value of the actual observed variable y (labor supply in hours or expenditures on cars or tobacco) changes due to changes of the explanatory variables.
In particular, there are actually three different partial effects present in model
\dfrac {\partial E(y^*|\mathbf x)}{\partial x_j} = \beta_j
\dfrac {\partial E(y|\mathbf x, y>0)}{\partial x_j} = \ ?
\dfrac {\partial E(y|\mathbf x, y≥0)}{\partial x_j} = \dfrac {\partial E(y|\mathbf x)}{\partial x_j}= \ ?
Hence, typically, it is the last partial effect we are interested in. For instance, if the wage or the number of children is changing, we want to know the effect on the supplied hours worked coming form already employed as well as form people which are not currently working.
\dfrac {\partial P(y_i>0|\mathbf x_i)}{\partial x_j} \, = \, \dfrac {\partial \Phi(\mathbf x_i \boldsymbol \beta/\sigma)}{\partial x_j} \, = \, \dfrac {\beta_j}{\sigma} \phi(\mathbf x_i \boldsymbol \beta/\sigma) \tag{10.17}
We already calculated the expectation E(y_i| \mathbf x_i, y_i≥0), (Equation 10.16). But here, we are dealing with the case of c_i=0, so the third term in Equation 10.16 is zero
E(y_i| \mathbf x_i, y_i≥0) \ = \ E(y_i| \mathbf x_i) \ = \ \mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \ =
\Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \left[ \mathbf x_i \boldsymbol \beta \, + \, \sigma \, \lambda \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \right]
\dfrac {\partial E(y_i|\mathbf x_i, y_i≥0)}{\partial x_j} \ = \ \beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.18}
Hence, the corresponding coefficient \beta_j has to be multiplied by a unit specific adjustment factor \Phi(\mathbf x_i \boldsymbol \beta / \sigma), which is clearly positive and less than 1
\widehat{A P E}_{j} = \hat\beta_{j} \, \frac{1}{n} \sum_{i=1}^{n} \Phi \left(\frac {\mathbf{x}_{i} \hat{\boldsymbol{\beta}}}{\sigma} \right) \tag{10.19}
\dfrac {\partial }{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \right ] \ =
\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) + \mathbf x_i \boldsymbol \beta \, \phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \frac {\beta_j}{\sigma} - \sigma\left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \frac {\beta_j}{\sigma} \ =
\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) + \mathbf x_i \boldsymbol \beta \, \phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right)\frac {\beta_j}{\sigma} - \left (\frac{\mathbf x_i \boldsymbol \beta} {1} \right) \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right)\frac {\beta_j}{\sigma} \ =
\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)
Sometimes, we only want the effect on the choices of already employed people on their hours worked in response to changing wages, or the choice about expenditures on tobacco for people which actually are smokers (the second partial effect mentioned above). In this case we need
E(y_i|\mathbf x_i, y_i>0) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )}
which we already encountered this in Equation 10.8, dealing with the truncated normal distribution. Remember, the function \lambda(z) is the inverse Mills ratio (Equation 10.7) and equals to \phi(z)/\Phi(z)
\dfrac {\partial}{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) }\right] \ = \beta_j \left[1 - \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) - \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)^2 \right] \tag{10.20}
Once again note, that d \Phi(z) / dz = \phi(z), \; d \phi(z) / d z = -z \phi(z) \, and \, \lambda(z)=\phi(z)/\Phi(z)
\text{ } \\ \dfrac {\partial}{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) }\right] \ =
\beta_j + \dfrac{- \sigma ( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma} \, \Phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) - \sigma \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma}\phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )} {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } =
\beta_j + \left[ - \dfrac{ \sigma ( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \frac {\beta_j}{\sigma}\Phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) } {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } - \dfrac{ \sigma \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma} \phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )} {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } \right] \ =
\beta_j \left[1 - \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) - \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)^2 \right]
Now, we are estimating a concrete example using the MROZ data, which we have already studied in the context of Logit and Probit models. We are interested in the number of working hours supplied by married woman, hours
, and the effects of several explanatory variables
These variables are nwifeinc
= non wife income, educ
= education in years, exper
= experience in years, age
, kidslt6
= number of kids younger than 6 and kidsge6
= number of kids older than 6
We estimate the model with OLS, Tobit and Probit. The reason for the Probit estimation is a special feature of the Tobit model:
The Tobit estimates implicitly incorporates the choice of doing something (for instance, the probability of working, depending on some variables) as well as the response of the intensity of doing something (e.g., the response of working time on education)
These two effects are: {\partial P(y_i>0|\mathbf x_i)}/{\partial x_j} = (\beta_j/\sigma) \phi(\mathbf x_i \boldsymbol \beta/\sigma) (Equation 10.12, Equation 10.17) and {\partial E(y_i|\mathbf x_i, y_i>0)}/{\partial x_j} = \beta_j [ 1 - ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) - \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} )^2 ] (Equation 10.20) and hence, are represented by the same parameter \beta_j times a positive constant
Therefore, every variable in the model affects the participation probability and the intensity equation with the same sign. Further, the Probit parameters should be approximately equal to \hat \beta_j/\hat \sigma in terms of the Tobit estimates. So, it is always a good idea to additionally estimate a corresponding Probit model (coding y_i > 0 as 1) and check whether the estimated coefficient actually have the same sign. If there are some coefficients with statistically significant different signs, then the Tobit model might not be appropriate
library(boot)
library(wooldridge)
data("mroz")
# OLS
<- glm(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, x=TRUE)
outols
# Tobit model - default of `tobit` from `AER` package: left = 0.
# You could also use `censReg` -> same result
<- tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)
outtobit
# with censReg - same result but logSigma is reported (but avg_slopes does not work)
#outtobit1 <- censReg(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)
# you can calulate PEA
# margEff(outtobit1)
########## for APEs for modelsummary()
## bootstrap only for std.errors necessary
<- function(data, indices) {
bootbdy <- data[indices,] # resampled data
dat <- tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
out data = dat)
<- avg_slopes(out)
m_tobit <- mean( pnorm( predict(out)/out$scale ), na.rm = TRUE ) # scaling factor, eq. 10.7
fac "estimate"] = m_tobit["estimate"]*fac
m_tobit[return( coef(m_tobit) )
}
<- boot(data=mroz, statistic=bootbdy, R=500)
bootresult_
<- matrixStats::colSds(bootresult_$t) # for debugging, should be the same as mm
se_boot
= summary(bootresult_) # to calculate std.errors
mm
## to use `modelsummary`, making a data.frame (with naming according broom package)
= data.frame(term=names(bootresult_$t0), estimate=mm$original, std.error=mm$bootSE, statistic=mm$original/mm$bootSE, p.value= (1 - pnorm( abs(mm$original/mm$bootSE) ))*2 )
m_tobit1
## modelsummary needs a special format of a list with tidy and glance
<- list( tidy = m_tobit1, glance = data.frame(NA) )
m_tobit class(m_tobit) <- "modelsummary_list"
##############
# Probit model - to check for misspecification
<- glm( inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
outprobit data = mroz, family = binomial(link = "probit"))
# Average partial effects for Probit with 'avg_slopes()'
<- avg_slopes(outprobit)
marg_probit
# Average partial effects for Tobit with 'avg_slopes()', but without correction factor fac
<- avg_slopes(outtobit) marg_tobit
<- tibble::tribble(~term, ~Bivariate, ~Multivariate, ~ab, ~cd, ~ef,
rows 'Empty row', '-', '-', '-', '-', '-',
'Another empty row', '?', '?', '?', '?', '?')
#attr(rows, 'position') <- c(1, 6)
modelsummary( list("OLS"=outols, "Tobit"=outtobit, "Tobit APE"=m_tobit, "Probit"=outprobit, "Probit APE"=marg_probit),
output="gt",
#statistic = c('std.error', 'statistic', 'p.value'),
#coef_omit = "Inter",
gof_omit = "F|N",
align = "lddddd",
stars = TRUE,
fmt = 3,
add_rows = rows
)
OLS | Tobit | Tobit APE | Probit | Probit APE | |
---|---|---|---|---|---|
(Intercept) | 1330.482*** | 965.305* | 0.270 | ||
(270.785) | (446.436) | (0.508) | |||
nwifeinc | -3.447 | -8.814* | -5.189* | -0.012* | -0.004* |
(2.544) | (4.459) | (2.593) | (0.005) | (0.001) | |
educ | 28.761* | 80.646*** | 47.473*** | 0.131*** | 0.039*** |
(12.955) | (21.583) | (12.649) | (0.025) | (0.007) | |
exper | 65.673*** | 131.564*** | 54.115*** | 0.123*** | 0.026*** |
(9.963) | (17.279) | (4.945) | (0.019) | (0.002) | |
I(exper^2) | -0.700* | -1.864*** | -0.002** | ||
(0.325) | (0.538) | (0.001) | |||
age | -30.512*** | -54.405*** | -32.026*** | -0.053*** | -0.016*** |
(4.364) | (7.419) | (4.259) | (0.008) | (0.002) | |
kidslt6 | -442.090*** | -894.022*** | -526.278*** | -0.868*** | -0.261*** |
(58.847) | (111.878) | (73.498) | (0.118) | (0.032) | |
kidsge6 | -32.779 | -16.218 | -9.547 | 0.036 | 0.011 |
(23.176) | (38.641) | (23.421) | (0.044) | (0.013) | |
R2 | 0.266 | ||||
AIC | 12117.1 | 7656.2 | 818.6 | 818.6 | |
BIC | 12158.7 | 7697.8 | 855.6 | 855.6 | |
Log.Lik. | -6049.534 | -401.302 | -401.302 | ||
RMSE | 746.18 | 953.68 | 0.42 | 0.42 | |
Left_censored | 325 | ||||
Un_censored | 428 | ||||
Right_censored | 0 | ||||
Total | 753 | ||||
Scale | 1122.02*** | ||||
Empty row | - | - | - | - | - |
Another empty row | ? | ? | ? | ? | ? |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
hours
, and the effects of several explanatory variables, estimated with OLS, Tobit, Tobit-APEs (Equation 10.19) and a Probit model for checking signs of parameters#### Code for `texreg` procedure - not reported
library(texreg)
# Calculating APEs for tobit model
# Basically with:
# fav <- mean( pnorm( predict(outtobit)/outtobit$scale ), na.rm = TRUE )
# APE <- fav * coef(outtobit)
source('~/Dropbox/R/User Functions/extract.margins.R') # user function for texreg
source('~/Dropbox/R/User Functions/extract.tobitMargins.R') # user function for texreg
# APE for Tobit with bootstrapping for standard errors. User fn tobitAPE is integrated in extract.tobitMargins.R
<- function(data, indices) {
bootbody <- tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
out data = mroz[indices,])
return( tobitAPE(out) )
}<- boot(data=mroz, statistic=bootbody, R=1000)
bootresult
# preparing for texreg to know tobit_output (append classes: "tobitMarg","APE")
<- outtobit
tobit_out class(tobit_out) <- c("tobitMarg","APE", "tobit", "survreg", "margins")
screenreg( list(outols, outtobit, tobit_out, outprobit, marg_probit ),
custom.model.names = c("OLS", "Tobit", "Tobit APE", "Probit", "Probit APE"),
digits = 3,
bootresults=bootresult, discrete=c(9)
)
Because of the different scaling factors involved, Tobit coefficients are not directly comparable to OLS coefficients. For instance, look at the OLS estimate for \beta_{educ} which is 28.8 and the Tobit estimate of 80.6. But this is the estimated effect on the latent variable, which is not really relevant in this context
To compare Tobit and OLS, one has to compare average partial effects, APEs, Equation 10.18. It turns out that the partial effects of Tobit and OLS are different not in all, but in a number of cases. For instance, the effect of education on E(y|\mathbf x), the expectation of observed hours worked, is estimated to be 47.5
Another difference between Tobit and OLS is that, due to the linearity of the model, OLS assumes constant partial effects, whereas partial effects are nonconstant in Tobit. To demonstrate this, look at Figure 10.8 below. There, the different predictors for hours worked are displayed, along with the observed points. The most relevant one is the blue line which represents E(y|\mathbf x). Note, how the blue line converges to the red one, the linear predictor \mathbf x \hat {\boldsymbol \beta} of the Tobit model, as the censoring becomes less imortant. Further note, the OLS predictor approximate E(y|\mathbf x) quite good for some range of \mathbf x
Lets us compare the estimates of the Tobit model with the Probit ones. It turns out that every coefficient has the same sign. Further, dividing the Tobit estimate for \beta_{educ} by the Tobit estimate for \sigma (which is 1122) we get 0.072 compared to 0.13 for Probit. Doing the same for \beta_{kidslt6} we get -0.8 compared to -0.87. This is not a huge difference but shows that the initial decision to work is stronger affected by education or small kids than the decision about the hours worked once in the labor force
<- function(x, model) { # calculates pred values for tobit model
tobitPredict <- pnorm( x/model$scale ) * x + model$scale * dnorm( x/model$scale )
pred return(pred) }
<- function(x, model) { # calculates Lambda values for tobit model
tobitPredict0 <- x + model$scale * dnorm( x/model$scale ) / pnorm( x/model$scale )
pred0 return(pred0) }
<- lm(mroz$hours ~ outtobit$linear.predictors)
olsout2
plot( outtobit$linear.predictors, mroz$hours, ylim=c(-1000, 5000), pch = 16, cex=0.9,
col=c("#666666", "#666666", "#666666"), xlab="Linear tobit predictor X*beta",
ylab="Hours and predictors of hours",
main = "Different predictors for hours worked" )
abline(c(0,0),c(1,1), col="red", lwd=2.5)
curve(tobitPredict(x, outtobit), from = -3000, to = 2200, n = 150, add = TRUE,
type = "l", col="blue", lwd = 4.5)
abline(olsout2, col="purple", lwd =2)
abline(0,0)
curve(tobitPredict0(x, outtobit), from = -3000, to = 2200, n = 150, add = TRUE,
type = "l", col="cornflowerblue", lwd = 3)
text(-2500, 640, "E(y|y>0)")
text(-600, -950, "E(y*)")
text(-1100, 280, "E(y|y≥0) = E(y)")
text(-1950, -690, "OLS")
Until now, we dealed with the problem of truncated and censored data which is directly linked to the dependent variable y and lead to a sample that is not random any more; this could renders OLS estimates inconsistent.
Remember, sample selection poses no problem if it is solely linked to exogenous variables or is exogenous to the errors 3
s_i y_i \, = \, s_i \mathbf x_i \boldsymbol \beta + s_i u_i \ \ \Rightarrow
E(s_i y_i|\mathbf x_i, s_i=1) \, = \, E(s_i \mathbf x_i \boldsymbol \beta| \mathbf x_i, s_i=1) + E(s_i u_i|\mathbf x_i, s_i=1) \ \ \Rightarrow
E(y_i|\mathbf x_i, s_i=1) \, = \, \mathbf x_i \boldsymbol \beta + \underbrace {E(u_i|\mathbf x_i, s_i=1)}_{\text {should be } 0} \, = \, \mathbf x_i \boldsymbol \beta
So, if the expectation of u_i is not influenced by s, the conditional expectation of y is equal to \mathbf x_i \boldsymbol \beta and thus, \boldsymbol \beta can be consistently estimated by OLS with the available sample
The expectation of u_i is not influenced by s if
s is only a function of the exogenous \mathbf x (exogenous sample selection), in which case the condition E(u_i|\mathbf x_i, s_i=1)=0 reduces to the usual condition E(u_i|\mathbf x_i)=0
s is purely random (random sample selection), in which case the condition E(u_i|\mathbf x_i, s)=0 is met
s is a mixture of both cases above
Incidental truncation or sample selection is a mechanism where sample selection is only indirectly linked to the dependent variable y or its error u; that means via other variables.
Very often but not always, this is due to self-selection
The leading example is the following: We want to estimate the person’s market wage what she could command in the labor market. This market wage would depend on her education, experience, age, etc. But if the wage offered by the market is too low, meaning that the market wage is lower than her reservation wage, she will not work at all and her potential market wage is not observed
This is clearly a self-section mechanism and a first stage decision determines whether the wage for this particular person is observed or not. The sample is not random any more and in particular, is linked to the unobserved reservation wage
People with low potential market wage or high reservation wage might therefore be heavily underrepresented in the sample
A solution to this problem is to define a separate model describing the decision for working versus non-working to get a clue whether a particular person will probably work and try to make an adjustment for the underrepresented cases
Hence, we will have two models, a selection or participation model and the outcome model (or sometimes called intensity model) which we are primarily interested in
The selection model (participation model) can be modeled as a simple Probit model; we want to estimate the conditions, from which on the person is expected to work (or more generally to participate). This is the case when the probability for working is larger than for non-working. Naturally, this leads to a probit model:
s_i^*=\mathbf z_i \boldsymbol \gamma + e_i, \ \ \ \ s_i = \begin{cases} 1 \ \text{ if } \ s_i^*>0 \\ 0 \ \text{ if } \ s_i ≤ 0 \end{cases} \tag{10.21}
Here, s_i^* is a latent variable, which in our example could be the difference between potential market wage and reservation wage
We predict labor market participation, s_i=1, if s_i^*>0, i.e., the market wage is larger than the reservation wage. Clearly, we predict out of labor force, s_i=0, if s_i^*≤0. In particular, s_i=1 if
e_i > - \mathbf z_i \boldsymbol \gamma \tag{10.22}
The equation of primary interest is the outcome equation (sometimes called intensity equation), in our example the wage equation:
y_i^* = \mathbf x_i \boldsymbol \beta + u_i, \ \ \ y_i = \begin{cases} y_i^* \ \text{ if } \ s_i=1 \\ \text{unobserved if } \ s_i = 0 \end{cases} \tag{10.23}
Here, y_i^* is the latent dependent variable, in our example the potential market wage, \mathbf x_i are the explanatory variables and u_i represents unobserved heterogeneity of the individuals which too affects the market wage
The observation rule says that y_i is only observed if s_i=1, i.e. s_i^*>0, which means participation in the labor market in our example. The value of s_i is determined in the first stage decision problem, the selection equation
Note, we assume that \mathbf z_i is observed for all i in the sample, as otherwise we couldn’t estimate the selection problem
According to the selection model (Equation 10.21), besides \mathbf z_i, the unobserved error e_i determines whether the market wage for person i is observed or not. Therefore, we have an incidental sample selection because these factor are not directly linked to y_i or u_i
We assume that the errors from the section and outcome equations, e and u, are jointly normal distributed with correlation \rho. This means that the joint distribution is solely described by the mean vector and the covariance matrix and thus rules out any nonlinear relationships between e and u:
\binom{e}{u} \ \sim \ N\left ( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 1 & \rho \, \sigma_u \\ \rho \, \sigma_u & \sigma_u^2 \end{bmatrix} \right ) \tag{10.24}
Note, \sigma_e^2 = 1 is no restriction of generality as this variance is not identified in a Probit model and thus can be arbitrarily normalized (we already discussed this point in the chapter on binary choice models)
We assume, that e_i as well as u_i are independent of (\mathbf x_i, \mathbf z_i), thus E(u_i | \mathbf x_i, \mathbf z_i)=0 and E(e_i | \mathbf x_i, \mathbf z_i)=0, MLR.4 (see Section 2.7)
Whether or not OLS applied to the wage equation leads to consistent and asymptotically normal distributed estimates of \boldsymbol \beta depends on whether the selection mechanism via e_i is exogenous (uncorrelated) with respect the error term u_i (and thus with y_i)
Therefore, the correlation coefficient \rho is crucial for this problem. If \rho \neq 0, a sample selection bias arises
To show the importance of \rho we decompose u_i into its conditional expectation on e_i and an error term \epsilon_i, which is uncorrelated with e_i by construction and uncorrelated with \mathbf x_i, \mathbf z_i by assumption (as u_i is independent of (\mathbf x_i, \mathbf z_i))
u_i \, = \, E(u_i|e_i) + \epsilon_i, \quad \text{with } \, E(\epsilon_i|\mathbf x_i, \mathbf z_, e_i)=0
Because (u_i,e_i) are jointly normal, E(u_i|e_i) is equal to the linear projection (regression) of u_i on e_i, with the projection coefficient \sigma_{u,e}/\sigma_e^2 (with \sigma_{u,e} = covariance of u and e)
u_i \, = \, \sigma_{u,e}e_i + \epsilon_i \, = \, \rho \, \sigma_u e_i + \epsilon_i
y_i^* = \mathbf x_i \boldsymbol \beta + \rho \, \sigma_u e_i + \epsilon_i \tag{10.25}
E(y_i^*|\mathbf x_i, \mathbf z_i, s_i=1) \ = \ \mathbf x_i \boldsymbol \beta \, + \, \rho \sigma_u E(e_i|s_i=1) \, + \, \underbrace {E(\epsilon_i|\mathbf x_i, \mathbf z_i, \overbrace {s_i=1}^{e_i>-\boldsymbol \gamma \mathbf z})}_{= \,0, \text{ as } \epsilon_i \, \perp \, (\mathbf x_i, \mathbf z_i,e_i) }
According to the selection model (Equation 10.21), the term before the last is:
\rho \sigma_u E(e_i|s_i=1) \, = \, \rho \sigma_u E(e_i|s_i^*>0) \, = \, \rho \sigma_u E(e_i|e_i>-\mathbf z_i \boldsymbol \gamma)
But this is the expectation of a truncated standardized normal distribution and hence, we can apply the theorem about truncated normal distributions, Theorem 10.1 (with y=e, \mu=0, c=-\mathbf z_i \boldsymbol \gamma and \sigma_{(e)}=1)
\rho \sigma_u E(e_i|e_i>-\mathbf z_i \boldsymbol \gamma) \, = \, \rho \sigma_u \lambda(\mathbf z_i \boldsymbol \gamma)
with \lambda(\mathbf z_i \boldsymbol \gamma) being the inverse Mills Ratio (see definition in Equation 10.7)
Putting things together, we finally arrive to the conditional expectation of observed y_i, which is the population regression in this case:
E(y_i|\mathbf x_i, \mathbf z_i, s_i=1) \, = \, \mathbf x_i \boldsymbol \beta + \rho \sigma_u \lambda(\mathbf z_i \boldsymbol \gamma) \tag{10.26}
Regarding to our leading example, if the correlation of unobserved wage determinants and unobserved determinants of the work decision is positive (\rho>0), the women observed working will have higher wages than expected solely from their characteristics (= positive selection). And this effect is represented by \lambda(\mathbf z_i \boldsymbol \gamma), which is strictly positive
As Equation 10.26 show, if \rho \neq 0, regressing y_i solely on \mathbf x_i leads to a sample selection bias in form of an omitted variable bias; the omitted variable \lambda(\mathbf z_i \boldsymbol \gamma ) is a nonlinear function of \mathbf z_i and therefore probably correlated with \mathbf x_i (as \mathbf x_i is usually a part of \mathbf z_i)
Finally, let us discuss another aspect more closely by means of our example. We argued that s_i^* in the selection equation could be interpreted as the difference between between market wage offers and reservation wage, (w_i - w_i^r). But according to the outcome Equation 10.23, the market wage offers are dependent on \mathbf x_i and on u_i. This implies two things
Thus, the variables in the outcome equation \mathbf x_i obviously plays some role in the selection equation (via market wage offers). Therefore, the variables \mathbf x_i should be part of the variables in \mathbf z_i, the variables of the selection equation. And this inclusion of all \mathbf x in \mathbf z is generally recommended as standard procedure
A very similar argument can be put forward for the unobserved factors of the outcome equation, u_i. These too influence the market wage offers and therefore the decision to work or not. Thus, these factors are also a part of e_i, the unobserved factors of the selection equation. Hence, the error terms u_i and e_i are expected to be positively correlated in our example, which implies \rho > 0
The model introduced above is called Hackman model or Heckit, Heckman (1979). It is a generalization of the Tobit model. In fact, if you set \mathbf x_i = \mathbf z_i, u_i = e_i and censoring y_i to zero for cases not observing y_i, we arrive to the Tobit model. For this reason the Hackman model is also sometimes called Tobit II model.
There are two procedures to estimate the model, a 2-step procedure and maximum likelihood estimation. Surprisingly, the 2-step procedure is much more common in practice. This procedure is as follows
P(s_i=1|\mathbf z_i) = \Phi(\mathbf z_i \boldsymbol \gamma) \ \rightarrow \ \hat {\boldsymbol \gamma}
\hat\lambda({\mathbf z}_i \hat{\boldsymbol \gamma}) = \dfrac {\phi({\mathbf z}_i \hat{\boldsymbol \gamma})}{\Phi({\mathbf z}_i \hat{\boldsymbol \gamma})}
y_i \, = \, \mathbf x_i \boldsymbol \beta \, + \, \alpha \, \hat\lambda(\mathbf z_i \hat{\boldsymbol \gamma}) \, + \, error \tag{10.27}
The whole procedure sounds a little bit complicated, but there are standard routines for both STATA and R. The corresponding R package is sampleSelection
with the procedure heckit.
These routines also calculate the correct standard errors for the 2-step procedure.
The estimated coefficient \hat \alpha of \hat \lambda in Equation 10.27 can be used to test for a sample selection bias. If this parameter is not statically significant different from zero, then there seems to be no sample selection bias as \hat \alpha is an estimate for \rho \sigma_u hence, \hat \alpha = 0 \Rightarrow \rho = 0
However, this whole 2-step procedure and testing for a sample selection should be carried out with great care. The reason is that there is an inherent identification problem; the variable \lambda is a nonlinear function (but over a wide range nearly a linear function) of the variables in \mathbf z. As we learned above, \mathbf x is generally contained in \mathbf z. Thus, we often have a server multicollinearity problem in the estimated outcome Equation 10.27, leading to very imprecise parameter estimates and powerless tests
To reduce the identification problem, we always should have variables in \mathbf z, which are not contained in \mathbf x, helping to identify the coefficients of \mathbf x in Equation 10.27 – in this case, \hat \lambda is not any more a nearly linear combination of the variables in \mathbf x
In practice, it is often hard to find variables that play a role in the selection model but credibly do not play a role in the outcome model. Erroneously omitting variables form the outcome model can lead to severe biases
We argued, that \mathbf z should contain at least one variable which is not contained in \mathbf x. Let us discuss this aspect by means of our example, which is estimated below.
We are interested in the market wage offers of married woman. We assume that education and experience play an important role in this respect. So, this is our outcome equation
To avoid a sample selection problem (because we only observe wages of employed woman), we apply Heckman’s 2-step procedure. Thus, we additionally specify a selection equation for the decision to work, to estimate the correction factor \lambda. We argued above, that all variables of the outcome equation should be included in the selection equation because possible wage offers are clearly an important factor for the decision to work as well
Therefore, the selection model contains at least education and experience as explanatory variables. However, to better identify the parameters of the outcome equation, we additionally need variables that are important for the decision to work and do not play a role for the market wage offers
library(sampleSelection)
library(wooldridge)
data("mroz")
# OLS estimation
<- lm(log(wage) ~ educ + exper + I(exper^2), data = mroz)
ols
# Heckit 2 Step model (default); the first line is the selection equation,
# the second line the outcome equation
<- heckit(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
heck log(wage) ~ educ + exper + I(exper^2),
data = mroz)
summary (heck)
--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
753 observations (325 censored and 428 observed)
15 free parameters (df = 739)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.270077 0.508593 0.531 0.59556
nwifeinc -0.012024 0.004840 -2.484 0.01320 *
educ 0.130905 0.025254 5.183 2.81e-07 ***
exper 0.123348 0.018716 6.590 8.34e-11 ***
I(exper^2) -0.001887 0.000600 -3.145 0.00173 **
age -0.052853 0.008477 -6.235 7.61e-10 ***
kidslt6 -0.868328 0.118522 -7.326 6.21e-13 ***
kidsge6 0.036005 0.043477 0.828 0.40786
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5781032 0.3050062 -1.895 0.05843 .
educ 0.1090655 0.0155230 7.026 4.83e-12 ***
exper 0.0438873 0.0162611 2.699 0.00712 **
I(exper^2) -0.0008591 0.0004389 -1.957 0.05068 .
Multiple R-Squared:0.1569, Adjusted R-Squared:0.149
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio 0.03226 0.13362 0.241 0.809
sigma 0.66363 NA NA NA
rho 0.04861 NA NA NA
--------------------------------------------
# Heckit MLE model
<- heckit(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
heckml log(wage) ~ educ + exper + I(exper^2),
data = mroz, method = "ML")
library(texreg)
screenreg( list(ols, heck, heckml), digits = 4, custom.model.names = c("OLS", "Heckit 2Stp", "Heckit MLE"),
prefix = FALSE, include.selection = FALSE, include.rsquared=FALSE )
=========================================================
OLS Heckit 2Stp Heckit MLE
---------------------------------------------------------
(Intercept) -0.5220 ** -0.5781 -0.5527 *
(0.1986) (0.3050) (0.2604)
educ 0.1075 *** 0.1091 *** 0.1084 ***
(0.0141) (0.0155) (0.0149)
exper 0.0416 ** 0.0439 ** 0.0428 **
(0.0132) (0.0163) (0.0149)
exper^2 -0.0008 * -0.0009 -0.0008 *
(0.0004) (0.0004) (0.0004)
invMillsRatio 0.0323
(0.1336)
sigma 0.6636 0.6634 ***
(0.0227)
rho 0.0486 0.0266
(0.1471)
---------------------------------------------------------
Adj. R^2 0.1509 0.1490
Num. obs. 428 753 753
Censored 325 325
Observed 428 428
AIC 1693.7702
BIC 1758.5071
Log Likelihood -832.8851
=========================================================
*** p < 0.001; ** p < 0.01; * p < 0.05
In our example, the coefficient of \lambda, the inverse Mills Ratio, is not significant. Thus, there is no significant sample selection bias. This is the reason, why the estimates of OLS and Heckit are very similar
In the table, the results of a maximum likelihood estimation are also reported. As mentioned above, the 2-step procedure is much more common in empirical work, although the MLE is more efficient, especially with regard to an estimate of \rho. The downside of MLE is that this is said to be more prone to specification errors, for instance with regard to the normal distribution assumption of the error terms
The likelihood function of the Heckman model consists of two parts: P(s_i=0), this is the Probit part, and the density of y_i, given s_i=1 times P(s_i=1)
\ln L = \sum_{s_i=0} \ln (1 - \Phi(\mathbf z_i \boldsymbol \gamma)) + \sum_{s_i=1} \ln f(y_i|s_i=1) \Phi(\mathbf z_i \boldsymbol \gamma)
As we have an incidental truncation, f(y_i|s_i=1) = f(y_i|e_i>-\mathbf z_i \boldsymbol \gamma) is a more complicated term based on a bivariate truncated normal distribution (see Green (2017), chapt.19, p. 995)
Last but not least, it should be noted that many of the methods for limited depended variables and sample selection discussed can be extended to 2SLS and some also to panel data (for details see Green (2017))
For a truncation from the right (from above) we have: \lambda(x) =\dfrac {-\phi(x)}{\Phi(-x)}.↩︎
Hint: Let \alpha = (\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma}). Then \lambda'(\alpha)=\frac {-\alpha \phi(\alpha) \Phi(\alpha) - \phi(\alpha)^2}{\Phi(\alpha)^2} = -\alpha \lambda(\alpha) - \lambda(\alpha)^2 = (-) and one can show that: -1<\lambda'(\alpha)<0.↩︎
The following results carry over to IV-estimations as well, if E(u|\mathbf z, s)=0, with \mathbf z being the (external + internal) instruments.↩︎