Limited dependent variables (LDV) are variables which are substantively restricted in their domain, for instance:
Binary variables, e.g., employed/not employed
-
Non-negative variables, e.g., wages, prices, interest rates
- Non-negative variables with excess zeros, e.g., labor supply
Count variables, e.g., the number of arrests in a year
Censored variables, e.g., unemployment duration
We further have
Sample selection models
-
The sample used to infer population relationships is endogenously selected, e.g., a wage offer regression but wage data are only observable if people are actually working
- The resulting sample is not random any more and could lead to biased estimates
10.1 Binary dependent variable
10.1.1 The linear probability model (LPM)
Until now, we only analysed models with dummy variables on the right hand side – as explanatory variables.
Now, we analyze models with a binary dependent variable, i.e., a dummy variable as left hand side variable.
- For instance, y may represents labor force participation of married women, which is one if the woman is in the labor force and is zero if not. If we assume a linear model we have
y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k + u
Under MLR.4, we have as usual (population regression line):
\Rightarrow \ E(y \mid \mathbf x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k = \, \mathbf x \boldsymbol \beta
As the dependent variable only takes on the values 0 and 1 we additionally have in this case:
E(y \mid \mathbf x) := \, 1 \cdot P(y=1 \mid \mathbf x) + 0 \cdot P(y=0 \mid \mathbf x)
This is the linear probability model, LPM: The model estimates the probability that y=1; in our example that a married woman is part of the labor force.
- The coefficient \beta_j describes the partial effect of the explanatory variable j on the probability that y=1
\beta_j = \dfrac{\partial P(y=1\mid \mathbf x)}{\partial x_j}
- This model was already studied in some detail in Section 5.8. You should look there especially for the estimated example
Code
# We calculate squared residuals and fitted values for plotting
ressq <- outm$residuals^2
yhat <- outm$fitted.values
Measures of fit for binary dependent variables
-
A popular measure of fit for such models is the hit rate, which measures the portion of correctly predicted observations of the dependent variable (which has only zeros or ones)
- Thereby, whenever \hat y_i (which is the estimated probability that y_i=1) is larger than the cutoff 0.5, the model predicts y_i=1 and when \hat y_i <0.5, the model predicts y_i=0
-
Besides the overall hit rate, one can separately examine, how the hit rates are for values of y_i=1 and y_i=0
-
The portion of correctly predicted y_is, when y_i=1, is called sensitivity. This is the ability of the model to detect true “positives” (e.g., the percentage of people predicted to be in the workforce and actually are or, as an example from medicine,
- the percentage of sick people who are correctly identified by a tumor marker as having the condition)
-
The portion of correctly predicted y_is, when y_i=0, is called specificity. This is the ability of the model to detect true “negatives” (e.g., the percentage of people predicted not to be in the workforce and actually are not or, as an example from medicine,
- the percentage of healthy people who are correctly identified by a tumor marker as not having the condition)
-
Below, we calculate the hit rates described above for the a LPM, and for purposes of comparison, for a corresponding Probit model (discussed below). The results are nearly identical – (the code for this hit-rate function is listed below and could by used by any user)
Code
hitrate <- function(outm, cutoff=0.5, stragaz=TRUE) {
# This function calculates overall hit rates, sensitivity and specificity for
# LPM, Logit and Probit models
# outm = output list of `lm()` or `glm()`
# cutoff = threshold value for predicting 1, default = 0.5
#requires package stargazer
# load dependent variable of model
y <- outm$model[[1]]
# load fitted values
yhat <- outm$fitted.values
ly <- length(y)
# prediction of one, if yhat > threshold
ypred <- (yhat > cutoff)*1
# y which are 1
y_1 <- y[y==1]
ly_1 <- length(y_1)
# y which are zero
y_0 <- y[y==0]
ly_0 <- length(y_0)
# prediction error
prederror <- y - ypred
# overall hits
hit_o <- rep(0, ly)
hit_o <- ifelse( (ypred == 1 & y == 1) | (ypred == 0 & y == 0) ,1, 0)
hitrate_o <- sum(hit_o)/ly
# correct prediction if y=1
hit_1 <- rep(0, ly)
hit_1 <- ifelse((ypred == 1 & y == 1), 1, 0)
hitrate_1 <- sum(hit_1)/ly_1
# correct prediction if y=0
hit_0 <- rep(0, ly)
hit_0 <- ifelse((ypred == 0 & y == 0), 1, 0)
hitrate_0 <- sum(hit_0)/ly_0
# storing results in data frame
results <- data.frame(HitRate = hitrate_o, Sensitivity = hitrate_1, Specificity = hitrate_0,
Nof1s = ly_1, Nof0s = ly_0, check.rows=TRUE)
if (stragaz==TRUE) {
stargazer::stargazer(results, type = "text", summary = FALSE, rownames = FALSE)
}
return(results)
}
# Hit rate of LPM
hr_lpm <- hitrate(outm)
===========================================
HitRate Sensitivity Specificity Nof1s Nof0s
-------------------------------------------
0.734 0.818 0.625 428 325
-------------------------------------------
# For comparison, estimate a corresponding Probit model
# which we introduce below
outprobit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
data = mroz, family = binomial(link = "probit"))
# Hit rate of Probit model
hr_probit <- hitrate(outprobit)
===========================================
HitRate Sensitivity Specificity Nof1s Nof0s
-------------------------------------------
0.734 0.813 0.631 428 325
-------------------------------------------
10.1.2 Probit and Logit models
The main problems of the Linear Probability Model (due to the linearity in \mathbf x) are
- Predictions sometimes lie outside the unit interval
- Partial effects of explanatory variables are constant
An obvious solution for these problems is a nonlinear model for binary response
- We assume that the expected response – the response probability – is a nonlinear continuous function G(\mathbf x) of the explanatory variables, with G'>0, G(-\infty) \rightarrow 0 and G(\infty) \rightarrow 1
E(y|\mathbf x) \, =\, P(y=1 | \mathbf{x}) \, = \, G\left(\beta_{0}+\beta_{1} x_{1}+\ldots+\beta_{k} x_{k}\right) \, = \, G(\mathbf{x} \boldsymbol{\beta}) \
G(\mathbf x) is the (inverse) link function which links the variables \mathbf x to the probability that y_i = 1
Obvious candidates for the function G() are cumulative distribution functions (cdf), which are increasing and map [-\infty, +\infty] into the unit interval [0,1]
The estimation model for the observable variable y therefore is
y \, = \, G(\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k) + u
The error term u of this model has the same properties as the one in the LPM, especially the errors are heteroscedastic with \operatorname{Var}(u_i | \mathbf x) = G(\mathbf x \boldsymbol \beta) (1 - G(\mathbf x \boldsymbol \beta)), (see Section 5.8)
Conditional on \mathbf x, the observations y_i are Bernoulli distributed
-
Regarding the choice of the link function there are two main models
- Probit model with the (inverse) link function: (with \phi(z) being the probability density and \Phi(z) being the cumulative distribution function (cdf) of the standard normally distribution, see Equation A.3 and Equation A.4)
G(z) \ = \ \Phi(z) := \, \int_{-\infty}^z \phi(v)dv \ = \, \int_{-\infty}^{\mathbf x \boldsymbol \beta} \phi(v)dv \ \ \quad \text{ :cdf of standard normal distribution}
- Logit model with the (inverse) link function:
G(z) \ = \ \Lambda(z) := \, \dfrac {e^z}{1 + e^z} \ = \ \dfrac {e^{\mathbf x \boldsymbol \beta}}{1 + e^{\mathbf x \boldsymbol \beta}} \ \ \ \quad \text{ :cdf of standard Logistic distribution}
Both distributions are symmetric and have quite similar shapes. However, the Logistic distribution has fatter tails - it’s nearly identical to a t(8)-distribution; therefore, outliers are somewhat more probable rendering the logistic estimates a little bit more robust against outliers
Latent variable interpretation
- Both the Logit and Probit model have a latent variable formulation/interpretation
y^* = \mathbf x \boldsymbol \beta + e, \quad \text{ and } \ \ y_i = \begin{cases} 1 \ \text{ if } \ y_i^*>0 \\ 0 \ \text{ if } \ y_i^*\leq 0 \end{cases}
e \sim N(0,1) with the Probit model; and e \sim Logistic(0,\pi^2/3) with the Logit model
-
If the latent variable y^* is larger than zero, the observable y takes on the value 1, if it is less or equal zero, y takes on 0 (y^* can thus be interpreted as the propensity to have y = 1)
- The latent variable is often interpreted as net utility for doing something – Random Utility models. If this net utility is higher as some threshold (0 in our case), then the utility is higher than the opportunity costs (utility of other alternatives) and the action is taken, i.e. y=1
Note that there is rarely a well defined unit of measurement for y^*. Therefore, identification requires a scaling, which is done by setting the variance of e to one, respectively to \pi^2/3
We do not observe the utility of the action – y^* is a latent variable – we only observe whether the action is taken or not, y=1 or y=0. But this is the case if y^* > or \leq 0. In the following it is shown that this model of discrete choice leads to our estimation model
- From the latent variable model it follows:
E(y \,| \, \mathbf x) \ = \ P(y = 1\,| \,\mathbf x) \ = \ P(y^* > 0 \,| \,\mathbf x) \ = \ P(\mathbf x \boldsymbol \beta + e > 0\,| \,\mathbf x) \ = \ P(e > -\mathbf x \boldsymbol \beta \,| \,\mathbf x) \ \ =
1 - G(-\mathbf x \boldsymbol \beta) \ \ = \text { (by symmetry) } \
G(\mathbf x \boldsymbol \beta) \ = \begin{cases} \Phi(\mathbf x \boldsymbol \beta) \ \text{ :Probit model } \\ \Lambda(\mathbf x \boldsymbol \beta) \ \text{ :Logit model } \end{cases}
- Hence, the latent variable model serves as a theoretical foundation of our nonlinear estimation model, postulated at the begin of this section. We have shown that:
E(y \,| \, \mathbf x) \, = \, G(\mathbf x \boldsymbol \beta)
- This leads to the estimation model
y \, = \, G(\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k) + u
- Note, neither a threshold of 0 in the latent variable model nor \operatorname{Var}(e) = 1 or \pi^2/3 in the latent variable model is a restriction of generality. The former is reflected by the intercept, the latter is only a scaling factor of the variables. So, nether the threshold nor \operatorname{Var}(e) are of particular value; they are not identified and hence free to choose
10.1.3 The Gumbel model
Sometimes, a non-symmetric distribution seems more appropriate, for instance if the number of ones (success) and zeros (unsuccess) are very different in the sample. In this case the Gumbel distribution (an extreme value distribution) can be used as an interesting alternative link function G(z) \ =\, {e^{-e^{-z}}} \ = {e^{-e^{-\mathbf x \boldsymbol \beta}}} \ \ \ \ \quad \text{ :cdf of standard Gumbel distribution}
This is a right skewed distribution, which might be appropriate if the ones are the rare (extreme) outcomes. If the zeros are the rare outcomes one could use a left skewed distribution, the (mirrored) Complementary Gumbel distribution: 1-G(-z) \ =\, 1-e^{-e^{z}} \ = 1-e^{-e^{\mathbf x \boldsymbol \beta}} \ \ \ \ \quad \text{ :cdf of Complementary Gumbel distribution}
Interestingly, only this version is implemented with the R procedure
glm()
. So, if you want to use this left skewed distribution and the ones are the rare outcomes in your sample, you have to recode the y-s with the command: “ifelse
(dataset$varname_of_y == 1, 0, 1)”, making ones to zeros and zeros to ones
10.1.4 LPM versus Probit
To show graphically the difference between the LPM and the Probit (or Logit) model we estimate the probability of denial of a bank credit (deny
) in dependence of the ratio of monthly dept payments to monthly income (pirat
)
library(visreg) # for plotting partial effect
library(wooldridge);
data("HMDA") # data set from of the wooldridge library
# convert `deny` to numeric. This is coded as a factor variable in the data set
HMDA$deny <- as.numeric(HMDA$deny) - 1
# estimating linear probability model
denylpm <- lm(deny ~ pirat, data = HMDA)
To estimate the probit model, we use the gml()
function in R - glm for generalized linear models. There are two additional option to set in this function:
family for the distribution of y; this is binomial in our case (binary dependent variable)
and you have to specify the link function G(\mathbf x \boldsymbol \beta) for the probit model
Code
visreg(denylpm, band = FALSE, ylim=c(-0.05, 1.05), main = "LPM", xlab = "P/I ratio")
abline(0,0, lty = 2, col = "darkred"); abline(1,0, lty = 2, col = "darkred")
text(2.5, 0.95, cex = 0.8, "Mortgage denied"); text(2.5, 0.05, cex= 0.8, "Mortgage approved")
points(deny~pirat, data = HMDA, pch = 20, cex.main = 0.85)
Code
visreg(denyprobit, scale = "response", band = FALSE, ylim=c(-0.05, 1.05), main = "Probit", xlab = "P/I ratio", rug=0)
abline(0,0, lty = 2, col = "darkred"); abline(1,0, lty = 2, col = "darkred")
text(2.5, 0.95, cex = 0.8, "Mortgage denied"); text(2.5, 0.05, cex= 0.8, "Mortgage approved")
points(deny~pirat, data = HMDA, pch = 20, cex.main = 0.85)
10.1.5 Partial Effects
The two previous figures show that the predicted probabilities in probit model are always within the unit interval, contrary to the LMP
-
Furthermore, the partial effects of the explanatory variables are not constant any more (as in the LPM), but are nonlinear and depend on the values of \mathbf x
On the one hand, this is a big theoretical advantage as linear partial effects are not possible in the context of predicting probabilities
On the other hand, this is a practical disadvantage as the estimated coefficients \hat\beta are not easily interpretable any more. Basically, you have to relay on average partial effects, APE.
The partial effect of the explanatory variable x_j on the probability that y=1 is
\dfrac{\partial P(y=1| \mathbf x)}{\partial x_j} \ = \ \dfrac {\partial G(\mathbf x \boldsymbol \beta)}{\partial x_j} \ = \begin{cases} \dfrac {\Phi(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \phi(\mathbf x \boldsymbol \beta) \, \beta_j \ \text{ : Probit } \\ \dfrac {\Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \Lambda(\mathbf x \boldsymbol \beta) (1 - \Lambda(\mathbf x \boldsymbol \beta) ) \, \beta_j \ \text{ : Logit } \end{cases}
- For the probit case, \phi is the density function of the normal distribution and we have
\dfrac {\partial \Phi(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \, \dfrac {\partial}{\partial \beta_j} \int_{-\infty}^{\mathbf x \boldsymbol \beta} \phi(v)dv \ = \ \phi(\mathbf x \boldsymbol \beta) \beta_j
- For the logit case, applying the Quotient rule, we have
\dfrac {\partial \Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \dfrac {\partial}{\partial \beta_j} \left( \dfrac {e^{\mathbf x \boldsymbol \beta}}{1 + e^{\mathbf x \boldsymbol \beta}} \right) \ = \ \dfrac {e^{\mathbf x \boldsymbol \beta}}{ (1 + e^{\mathbf x \boldsymbol \beta})^2} \, \beta_j
After some manipulations, we eventually get
\dfrac {\partial \Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \Lambda'(\mathbf x \boldsymbol \beta) \, \beta_j \ = \ \Lambda(\mathbf x \boldsymbol \beta) \, (1 - \Lambda(\mathbf x \boldsymbol \beta) ) \, \beta_j
In both cases the partial effect of x_j on the P(y=1 | \mathbf x) is not equal to \beta_j (which is partial effect on the latent variable y^*), but \beta_j must be multiplied by the corresponding density function, evaluated at \, \mathbf x \boldsymbol \beta, i.e., at particular values of \mathbf x
-
Hence, the partial effects are nonlinear and depend on the level of \mathbf x
- Remark: The formulas for the partial effects just derived are only valid for continuous or nearly continuous variables
-
One can calculate APE (average partial effects) for instance with the R package
marginaleffects
, or PEA (partial effects at average)The former is the partial effect calculated for each observation \mathbf x_i and than averaged. For the Probit model this is: \widehat{A P E}_{j} =\left( \frac{1}{n} \sum_{i=1}^{n} \phi\left(\mathbf{x}_{i} \hat{\boldsymbol{\beta}}\right) \right) \hat\beta_{j}
The latter is calculated at the mean of \mathbf x, \bar {\mathbf x}. For the Logit model this is \, \widehat{P E A_{j}} = \Lambda' (\bar{\mathbf{x}} \hat{\boldsymbol{\beta}}) \hat\beta_{j}
PEAs could be somewhat problematic in the case of dummy variables. For instance, if you have a dummy variable for gender in your model, averaging to 0.5 might be questionable
For a discrete explanatory variable x_k, e.g., the number of years of education or gender, the APE can directly be computed (for the Logit model replace \Phi by \Lambda)
\widehat{A P E}_{k} \ = \ n^{-1} \sum_{i=1}^n \left( \Phi\left[\beta_{0}+\beta_{1} x_{i1}+\ldots+\beta_{k}\left(c_{ik}+1\right) \right] \ - \ \Phi \left[\beta_{0}+\beta_{1} x_{i1}+\ldots+\beta_{k} c_{ik} \right] \right)
The R package marginaleffects
apply this formula automatically to factor variables
10.1.6 Estimation
As both models are highly nonlinear, OLS can not be applied. In principle, NLS (nonlinear last squares) could be applied, but because of the heteroscedasticity of the error terms this would not be efficient
-
Maximum Likelihood estimation (MLE) is the main procedure
Thereby, we are looking for coefficients \boldsymbol \beta so that the observed sample has the highest probability
Each observation i is treated as a single and independent draw from a Bernoulli distribution (binominal with one draw)
We have a success probability for observation i: \text{ } (remember: G=\Phi for probit and G=\Lambda for logit) P(y_i=1 \, | \, \mathbf x; \boldsymbol \beta) = G(\mathbf x_i \boldsymbol \beta)
and the non-success probability for observation i:
P(y_i=0 \, | \, \mathbf x; \boldsymbol \beta) = 1-G(\mathbf x_i \boldsymbol \beta)
This leads to the joint probability function (for all observations)
P(y_1, \ldots, y_n \, | \, \mathbf x; \boldsymbol \beta) \ = \ \underset {y=0} {\prod_{i=1}^n} \left[ 1-G(\mathbf x_i \boldsymbol \beta) \right] \, \underset {y=1} {\prod_{i=1}^n} G(\mathbf x_i \boldsymbol \beta)
- This can be conveniently rewritten to the likelihood function of the sample for the n observations
L(\boldsymbol \beta \ | \ \text{data}) \ = \ \prod_{i=1}^n \underbrace {\left[ G(\mathbf x_i \boldsymbol \beta) \right]^{y_i} \, \left[1-G(\mathbf x_i \boldsymbol \beta) \right]^{1-y_i}}_{= f(y_i | \mathbf x, \boldsymbol \beta), \, \text{ conditional density of } y_i}
(Note, if y_i=0 then 1-y_i=1 and if y_i=1 then 1-y_i=0)
Taking logs we get the log-likelihood function
\ln L(\boldsymbol \beta \ | \ \text{data}) \ = \ \sum_{i=1}^n \underbrace {\left\lbrace y_i \ln \left[ G(\mathbf x_i \boldsymbol \beta) \right] \, + \, (1-y_i) \ln \left[1-G(\mathbf x_i \boldsymbol \beta) \right] \right \rbrace}_{\ln f(y_i | \mathbf x, \boldsymbol \beta)}
- We maximize \ln L(\boldsymbol \beta \, | \, \text{data}) with respect to \boldsymbol \beta such that the observed sample has the highest probability. This leads to the likelihood equations (the counterparts to the normal equations of OLS)
\dfrac { \partial \ln L}{\partial \boldsymbol \beta} \ = \ \sum_{i=1}^n \left[ y_i \dfrac{g(\mathbf x_i \boldsymbol \beta)}{G(\mathbf x_i \boldsymbol \beta)} \ + \ (1-y_i) \dfrac{-g(\mathbf x_i \boldsymbol \beta)}{(1-G(\mathbf x_i \boldsymbol \beta)} \right] \mathbf x_i \ = \ \mathbf 0, \quad \text{ with } g := G'
- After some manipulations, for the Logit case, these likelihood equation reduce to
\dfrac { \partial \ln L}{\partial \boldsymbol \beta} \ = \ \sum_{i=1}^n [ y_i - \Lambda(\mathbf x_i \boldsymbol \beta) ] \mathbf x_i \ = \ \mathbf 0 \
Noting that [y_i - \Lambda(\mathbf x_i \boldsymbol \beta)] are the residuals of the model, the similarities to the OLS normal equations are obvious (compare Equation 1.9)
-
If \boldsymbol \beta contains an intercept, the first likelihood equation is
\dfrac { \partial \ln L}{\partial \beta_0} \ = \ \sum_{i=1}^n [ y_i - \Lambda(\mathbf x_i \boldsymbol \beta) ] \mathbf \ = \ 0 \ \ \ \Rightarrow \ \ \ \frac{1}{n} \sum_{i=1}^n y_i = \frac{1}{n} \sum_{i=1}^n \Lambda(\mathbf x_i \boldsymbol \beta)
- Hence, portions of ones in the sample equals the the average of the predicted probabilities
Interestingly, the likelihood equations for the Probit case don’t look so familiar, involving \phi/\Phi, the inverse Mills ratio
In both cases, the likelihood equations are highly nonlinear and have to be solved numerically, for instance with the Newton-Raphson algorithm
10.1.7 Example
We once more examine a model for the probability of labor force participation of married woman,
inlf
(for in labor force)The explanatory variables are:
nwifeinc
(additional family income),educ
,exper
,exper^2
,age
,kidslt6
(number of children under 6 years) andkidsge6
(number of children over 6 years)This time, we additionally estimate Logit and Probit models. Furthermore, we used the R-package
marginaleffects
to calculate the implied APE-
The results presented below are pretty similar for the three models
Note, you should only directly compare the APE
-
As shown previously, the coefficients of Logit and Probit models are non-linear transformations of the partial effects
- The Logit coefficients are typically 1.6 to 1.8 times larger than the Probit coefficients
- Probit coefficients are typically about 3.3 times larger than the ones of the LPM
- The Logit coefficients are typically 1.6 to 1.8 times larger than the Probit coefficients
Fortunately, the R-package
marginaleffects
is doing the work for us and calculates the appropriate APE
-
All estimated coefficients have the expected sign
- Once more, of special interest is the variable
kidslt6
, the number of children under 6 years. The estimated APE is about -0.26 meaning that, for an average woman, an additional young child reduces the probability to be in the labor force by 26%
- Once more, of special interest is the variable
But for Logit and Probit models, this partial effect depends in a nonlinear manner on the values of the explanatory variables. This is a main advantage compared to the LPM. The following figures show these nonlinear effects of the number of children under 6 on labor participation
library(wooldridge)
library(marginaleffects)
library(modelsummary)
data("mroz")
# LPM model
outlpm <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)
#Logit model
outlogit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz,
family = binomial(link = "logit"))
# Probit model
outprobit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz,
family = binomial(link = "probit"))
# Average partial effects with avg_slopes() from marginaleffects package
marg_lpm <- avg_slopes(outlpm)
marg_logit <- avg_slopes(outlogit)
marg_probit <- avg_slopes(outprobit)
summary(outlogit)
Call:
glm(formula = inlf ~ nwifeinc + educ + exper + I(exper^2) + age +
kidslt6 + kidsge6, family = binomial(link = "logit"), data = mroz)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.425452 0.860365 0.495 0.62095
nwifeinc -0.021345 0.008421 -2.535 0.01126 *
educ 0.221170 0.043439 5.091 3.55e-07 ***
exper 0.205870 0.032057 6.422 1.34e-10 ***
I(exper^2) -0.003154 0.001016 -3.104 0.00191 **
age -0.088024 0.014573 -6.040 1.54e-09 ***
kidslt6 -1.443354 0.203583 -7.090 1.34e-12 ***
kidsge6 0.060112 0.074789 0.804 0.42154
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1029.75 on 752 degrees of freedom
Residual deviance: 803.53 on 745 degrees of freedom
AIC: 819.53
Number of Fisher Scoring iterations: 4
summary(outprobit)
Call:
glm(formula = inlf ~ nwifeinc + educ + exper + I(exper^2) + age +
kidslt6 + kidsge6, family = binomial(link = "probit"), data = mroz)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2700736 0.5080782 0.532 0.59503
nwifeinc -0.0120236 0.0049392 -2.434 0.01492 *
educ 0.1309040 0.0253987 5.154 2.55e-07 ***
exper 0.1233472 0.0187587 6.575 4.85e-11 ***
I(exper^2) -0.0018871 0.0005999 -3.145 0.00166 **
age -0.0528524 0.0084624 -6.246 4.22e-10 ***
kidslt6 -0.8683247 0.1183773 -7.335 2.21e-13 ***
kidsge6 0.0360056 0.0440303 0.818 0.41350
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1029.7 on 752 degrees of freedom
Residual deviance: 802.6 on 745 degrees of freedom
AIC: 818.6
Number of Fisher Scoring iterations: 4
Code
# Modifications for modelsummery to print HitRates for glm()
library(broom)
# untereinander
glance_custom.glm <- function(x, ...) {
HRate <- round( hitrate(x, stragaz=FALSE)$HitRate, 2 )
Sens <- round( hitrate(x, stragaz=FALSE)$Sensitivity, 2 )
Spec <- round( hitrate(x, stragaz=FALSE)$Specificity, 2 )
Ratio <- round( hitrate(x, stragaz=FALSE)$Nof1s / hitrate(x, stragaz=FALSE)$Nof0s, 2 )
out <- data.frame("HitRate" = HRate,
"Sensitifity" = Sens,
"Specificity" = Spec,
"Ratio 1/0" = Ratio)
return(out)
}
Code
modelsummary( list(" LPM "=outlpm, " LPM APE "=marg_lpm, " Logit "=outlogit,
" Logit APE "=marg_logit, " Probit "=outprobit,
" Probit APE "=marg_probit),
output="flextable",
gof_omit = "F",
align = "ldddddd",
stars = TRUE,
fmt = 4)
|
LPM |
LPM APE |
Logit |
Logit APE |
Probit |
Probit APE |
---|---|---|---|---|---|---|
(Intercept) |
0.5855*** |
0.4255 |
0.2701 |
|||
(0.1542) |
(0.8604) |
(0.5081) |
||||
nwifeinc |
-0.0034* |
-0.0034* |
-0.0213* |
-0.0038* |
-0.0120* |
-0.0036* |
(0.0014) |
(0.0014) |
(0.0084) |
(0.0015) |
(0.0049) |
(0.0015) |
|
educ |
0.0380*** |
0.0380*** |
0.2212*** |
0.0395*** |
0.1309*** |
0.0394*** |
(0.0074) |
(0.0074) |
(0.0434) |
(0.0073) |
(0.0254) |
(0.0073) |
|
exper |
0.0395*** |
0.0268*** |
0.2059*** |
0.0254*** |
0.1233*** |
0.0256*** |
(0.0057) |
(0.0025) |
(0.0321) |
(0.0022) |
(0.0188) |
(0.0022) |
|
I(exper^2) |
-0.0006** |
-0.0032** |
-0.0019** |
|||
(0.0002) |
(0.0010) |
(0.0006) |
||||
age |
-0.0161*** |
-0.0161*** |
-0.0880*** |
-0.0157*** |
-0.0529*** |
-0.0159*** |
(0.0025) |
(0.0025) |
(0.0146) |
(0.0024) |
(0.0085) |
(0.0024) |
|
kidslt6 |
-0.2618*** |
-0.2618*** |
-1.4434*** |
-0.2578*** |
-0.8683*** |
-0.2612*** |
(0.0335) |
(0.0335) |
(0.2036) |
(0.0319) |
(0.1184) |
(0.0319) |
|
kidsge6 |
0.0130 |
0.0130 |
0.0601 |
0.0107 |
0.0360 |
0.0108 |
(0.0132) |
(0.0132) |
(0.0748) |
(0.0133) |
(0.0440) |
(0.0132) |
|
Num.Obs. |
753 |
753 |
753 |
753 |
753 |
753 |
R2 |
0.264 |
0.264 |
||||
AIC |
865.8 |
865.8 |
819.5 |
819.5 |
818.6 |
818.6 |
BIC |
907.4 |
907.4 |
856.5 |
856.5 |
855.6 |
855.6 |
Log.Lik. |
-423.892 |
-423.892 |
-401.765 |
-401.765 |
-401.302 |
-401.302 |
RMSE |
0.42 |
0.42 |
0.42 |
0.42 |
0.42 |
0.42 |
HitRate |
0.73 |
0.73 |
0.74 |
0.74 |
0.73 |
0.73 |
Sensitifity |
0.82 |
0.82 |
0.81 |
0.81 |
0.81 |
0.81 |
Specificity |
0.62 |
0.62 |
0.64 |
0.64 |
0.63 |
0.63 |
Ratio.1.0 |
1.32 |
1.32 |
1.32 |
1.32 |
1.32 |
1.32 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Code
Code
Code
10.2 Maximum likelihood estimation in general
The starting point for obtaining maximum likelihood estimators (MLE) is the assumption that the conditional distribution of the data – the data generating process – is known, except for a finite number of parameters \boldsymbol \theta which have to be estimated
- So, we have the probability density function, (if y_i is a continuous RV) respectively the probability mass function (if y_i is a discrete RV) for observation i
f(y_i \, | \, \mathbf x_i; \boldsymbol \theta), \ \ \forall i
- The joint density of n independent and (identically) distributed (i.i.d.) observations for this process is the product of the individual densities
f(y_1, \ldots, y_n \, | \, \mathbf X; \boldsymbol \theta) \ = \ \prod_{i=1}^n f(y_i \, | \, \mathbf x_i; \boldsymbol \theta)) \ = \ L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X)
-
The joint density also defines the Likelihood function L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X), which is a function of the unknown parameters \boldsymbol \theta, given the sample data (\mathbf y, \mathbf X)
- Note the reversed writing of the variables. This should simply indicate that we are interested in finding the parameters given the data (not that \boldsymbol \theta is random)
Remark: If the observations are not independent, as it is very often the case with time series, the joint density is more complicated and involves the covariance matrix of the observations y_i. For more complicated time series modes, in many cases the joint density can be obtained with the so called prediction-error decomposition of the likelihood which is calculated with the help of the Kalman-filter
- As it usually simper to work with logs, the log-likelihood function is
\ln L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X) \ = \ln L(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \ln f(y_i \, | \, \mathbf x_i, \boldsymbol \theta)
- We apply the Maximum Likelihood principle: In estimating \boldsymbol \theta, we are looking for parameters \boldsymbol \theta so that the observed sample has the highest probability
\underset{\boldsymbol \theta}{\max} \ \ln L(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \ln f(y_i \, | \, \mathbf x_i, \boldsymbol \theta)
- The first order conditions of this optimization problem lead to the so called likelihood equations, which are the counterparts to the normal equations of OLS
\dfrac { \partial \ln L}{\partial \boldsymbol \theta} := \ \mathbf g(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \frac {\partial }{\partial \boldsymbol \theta} \, \ln f(y_i | \mathbf x_i, \boldsymbol \theta) \ = \ \mathbf 0
-
The solution of the likelihood equations leads to the estimated parameter vector \hat{\boldsymbol \theta}_{LM}
This estimate is identified (estimable) if for any other \boldsymbol \theta, L(\hat{\boldsymbol \theta}_{LM}) \neq L(\boldsymbol \theta) (uniqueness)
In most of the cases, the optimization has to rely on numerical optimization procedures like the Newton-Paphson algorithm
The second order condition for maximization of the likelihood function requires that the Hessian matrix is negative definite, so that the likelihood function is concave at the optimum
10.2.1 Properties of MLE
If the distributional assumption holds Maximum Likelihood estimators are generally
-
Consistent
\operatorname{plim} \hat{\boldsymbol \theta}_{ML} = \boldsymbol \theta_T, \quad with \boldsymbol \theta_T being the true parameter vector
But in finite samples, MLE are not generally unbiased
\text{ }
-
Asymptotically normal with
\hat {\boldsymbol \theta}_{ML} \ \, \stackrel{a}{\sim} \ \, N \left( \boldsymbol \theta_T ,\, \lbrace -E[ \mathbf H(\boldsymbol \theta_T)] \rbrace^{-1} \right)
- Thereby, \mathbf H(\boldsymbol \theta_T) is the Hessian Matrix of the log-likelihood function, evaluated at the true parameter vector \boldsymbol \theta_T:
\operatorname {Asy.Var} (\hat {\boldsymbol \theta}_{ML}) \, = \, \ \left( - E \left[ \mathbf H(\boldsymbol \theta_T) \right] \right) ^{-1} \, = \, \left( -E \left[ \dfrac {\partial^2 \ln L}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right] \right)^{-1} \, = \,
\left( -E \, \left[ \begin{array}{ccc} \dfrac {\partial^2 \ln L }{\partial \theta_0 \partial \theta_0} & \cdots & \dfrac {\partial^2 \ln L }{\partial \theta_0 \partial \theta_k} \\ \vdots & \ddots & \vdots \\ \dfrac {\partial^2 \ln L }{\partial \theta_k \partial \theta_0} & \cdots & \dfrac {\partial^2 \ln L }{\partial \theta_k \partial \theta_k} \end{array} \right] \right)^{-1}
\text{ }
-
Asymptotically efficient in the class of consistent and asymptotically normal estimators
-
The Cramer-Rao Lower Bound is the asymptotic lower bound of the variance of a consistent and asymptotically normal estimator. And this can be shown to be equal to the inverse of the Information Matrix
- The Information matrix \mathbf I_n(\boldsymbol \theta_T) for \boldsymbol \theta_T is defined as the covariance matrix of the gradient \mathbf g, the score, of the log-likelihood function evaluated at the true parameter vector. \mathbf I_n(\boldsymbol \theta_T) measures the amount of information about \boldsymbol \theta_T contained (on average) in a realization of the data
\mathbf I_n (\boldsymbol \theta_T) := \operatorname {Var} \left( \mathbf g(\boldsymbol \theta_T) \right) = E \left( \mathbf g(\boldsymbol \theta_T) \, \mathbf g(\boldsymbol \theta_T)' \right),
\text{with } \ \mathbf g(\boldsymbol \theta) := \dfrac {\partial }{\partial \boldsymbol \theta} \ln L(\boldsymbol \theta; \mathbf y), \ \ E[\mathbf g(\boldsymbol \theta_T)] = \mathbf 0
- Under fairly general conditions, the Information Matrix is equal to the negative expected value of the Hessian Matrix, evaluated at the true parameters. This is the Information matrix equality
\mathbf I_n (\boldsymbol \theta_T) = \operatorname {Var} \left( \mathbf g(\boldsymbol \theta_T) \right) = -E[\mathbf H(\boldsymbol \theta_T)]
- The right hand term equals the inverse of the covariance matrix of \hat {\boldsymbol \theta}_{ML}. Hence, the variance of MLE is equal to Cramer-Rao Lower Bound and therefore, the MLE is asymptotically efficient
-
Finite sample property: If the sample is drawn from an exponential family distribution and if there exists an unbiased minimum variance estimator, then it is a MLE
10.2.2 Covariance matrix
- Sketch of a proof that the MLE is asymptotically normal. We make a first order approximation of the first order conditions and assume that \mathbf y is i.d. (to easy the notation we write \hat{\boldsymbol \theta} instead of \hat{\boldsymbol \theta}_{LM})
\mathbf g (\hat {\boldsymbol \theta}) = \mathbf 0, \ \ \Rightarrow \ \ \mathbf g ( \hat {\boldsymbol \theta}) \approx \mathbf g (\boldsymbol \theta_T) + \underbrace{\mathbf H(\boldsymbol \theta_T)}_{\frac {\partial }{\partial \boldsymbol \theta} \mathbf g(\boldsymbol \theta_T)} (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = \mathbf 0 \ \ \Rightarrow
\mathbf g (\boldsymbol \theta_T) = -\mathbf H(\boldsymbol \theta_T)(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) \ \ \Rightarrow
(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = [-\mathbf H(\boldsymbol \theta_T)]^{-1} \mathbf g (\boldsymbol \theta_T) \ \ \Rightarrow
(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = [- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]^{-1} \frac{1}{n} \mathbf g (\boldsymbol \theta_T) \ \ \Rightarrow
\sqrt n (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = \underbrace {[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]^{-1}}_{\stackrel{p}{\longrightarrow} \ \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1}} \ \ \ \ \cdot \underbrace{\sqrt n\ \bar{\mathbf g} (\boldsymbol \theta_T)}_{\stackrel{d}{\longrightarrow} \ N(\mathbf 0,\, E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)])} \ \ \Rightarrow
\sqrt n (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) \ \stackrel{d}{\longrightarrow} \ N \left( \mathbf 0, \, \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1} \, E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)] \, \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1} \right) \ =
N (\mathbf 0, \, \lbrace E[-\frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1})
- For the last two we invoke the i.d. assumption and the Information matrix equality. In this case
\mathbf g (\boldsymbol \theta) = \sum\nolimits_{i=1}^n \mathbf g_i(\boldsymbol \theta), \ \ \text{ with } \mathbf g_i(\boldsymbol \theta) := \dfrac {\partial \ln f(y_i|\mathbf x, \boldsymbol \theta)} {\partial \boldsymbol \theta}, \ \ \text{ and } \ \ \mathbf {\bar g} (\boldsymbol \theta) = \frac {1}{n} \sum\nolimits_{i=1}^n \mathbf g_i(\boldsymbol \theta)
\mathbf H(\boldsymbol \theta) = \sum\nolimits_{i=1}^n \mathbf H_i (\boldsymbol \theta), \ \ \text{ with } \mathbf H_i(\boldsymbol \theta) := \dfrac {\partial^2 \ln f(y_i|\mathbf x, \boldsymbol \theta)} {\partial \boldsymbol \theta \, \partial \boldsymbol \theta'}
\sqrt n\ \bar{\mathbf g} (\boldsymbol \theta_T) is \sqrt n times the average of random variables \mathbf g_i with an expected value of 0 and covariance matrix \operatorname {Var}\left( \mathbf g_i (\boldsymbol \theta_T) \right) = -E[\mathbf H_i (\boldsymbol \theta_T)] (by the Information matrix equality) and an average covariance matrix of -E[\frac{1}{n} \sum_i \mathbf H_i (\boldsymbol \theta_T)] = -E[\frac {1}{n} \mathbf H (\boldsymbol \theta_T) ]. Hence, we can invoke a CLT
-\frac{1}{n} \mathbf H(\boldsymbol \theta_T) is an average too and, by LLN, converges in probability to its expected value -E[\frac {1}{n} \mathbf H(\boldsymbol \theta_T)]
From the limiting distribution of \sqrt n (\hat {\boldsymbol \theta}_{LM} - \boldsymbol \theta_T) we get the asymptotic distribution of \hat{\boldsymbol \theta}_{ML}
\hat {\boldsymbol \theta}_{LM} \ \, \stackrel{a}{\sim} \ \, N \left( \boldsymbol \theta_T ,\, \lbrace -E[ \mathbf H(\boldsymbol \theta_T)] \rbrace ^{-1} \right)
- This basically implies consistency of \hat {\boldsymbol \theta}_{LM} too. The essential point was E[\mathbf g(\boldsymbol \theta_{T})] = \mathbf 0 in third last line of the proof
As the expected value in the covariance matrix formula is generally difficult to obtain, the covariance matrix of the estimated parameters is simply estimated by the inverse of the actual (not the expected) negative Hessian matrix at \hat{\boldsymbol \theta}_{LM}
\widehat{\operatorname {Var} (\hat{\boldsymbol \theta})} \ = \ -\mathbf H(\hat {\boldsymbol \theta})^{-1} \ =
\left( -\dfrac {\partial^2 \ln L(\hat{\boldsymbol \theta})}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right)^{-1} \ = \ \left( - \dfrac { \sum_{i=1}^n \partial^2 \ln f(y_i | \mathbf x; \hat{\boldsymbol \theta})}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right)^{-1} \ = \ \left( -\sum_{i=1}^n \mathbf H_i(\hat {\boldsymbol \theta}) \right)^{-1}
with f(y_i | \mathbf x; \boldsymbol \theta) being the conditional distribution function of observation i
- Another often used estimator for the covariance matrix involves the outer product of the gradient (score) vectors. This is based on the fact shown above that the Information matrix is defined as the variance of the scores, which is the expected value of the outer product of the gradient vectors
\widehat{\operatorname {Var} (\hat{\boldsymbol \theta})} \ = \ \left( \sum_{i=1}^n \mathbf g_i(\hat{\boldsymbol \theta}) \, \mathbf g_i(\hat{\boldsymbol \theta})' \right)^{-1}
- This is the so called BHHH estimator and is comparatively easy to obtain. This estimate is always positive definite, but in small samples the estimator using the Hessian Marix generally performs better
-
Why does the variance of the estimated parameters depend on the second derivatives?
The second derivative measures how concave the function is. If this is (in absolute terms) small at the optimum (nearly linear, with a slope of 0), the likelihood function is flat at the optimum. This means that there is a large interval of other parameters having nearly the same likelihood. Hence, the variance of the estimated \theta is high
On the other hand, if the second derivative is large in absolute terms, the likelihood function is very curved at the optimum, meaning that there is only a small interval of parameters having nearly the same likelihood. Therefore, the estimate is precise and the variance of \hat \theta is low
Hence, we have an inverse relationship between the size of the second derivatives of the likelihood function and the variance of the estimated parameters; and this inverse rekationship is reflected with the covariance matrix formula
-
As we already know, the second order condition for maximization of the likelihood function requires that the Hessian matrix is negative definite, so that the likelihood function is concave at the optimum
For the Logit model we have: \mathbf H = - \sum_i \Lambda(\mathbf x_i \boldsymbol \beta) (1-\Lambda(\mathbf x_i \boldsymbol \beta) )\mathbf x_i \mathbf x_i', which is clearly negative definite for all \boldsymbol \theta
As similar result applies to the Probit model which means that the likelihood functions are always concave for the two models, guaranteeing a global maximum
10.2.3 Hypothesis testing
-
Because we have an estimate of the covariance matrix of \hat {\boldsymbol \theta} we can carry out all of the usual Wald tests about linear hypothesis (after linearization, nonlinear restrictions are also possible - the restriction matrix \mathbf R is replaced by the Jacobi Matrix of the nonlinear restrictions)
With this test procedure, only the unrestricted model have to be estimated. If the restriction is not compatible with the data, i.e., the restricted estimate \mathbf R \hat{\boldsymbol \theta} is too far away from the hypothesized value, the restriction is rejected
The big advantage of this test strategy is that with a large enough sample no assumption about the distribution is required (we do not need the likelihood function)
-
In the context of MLE, we have an estimate of the log-likelihood of the observed sample. This can be exploited for a Likelihood Ratio (LR) test
- With this procedure, the restricted as well as the unrestricted model is estimated. We are then looking at the log-ratio of the likelihood of the two models. If the difference of the log-likelihood of the two models is too large, the restricted model can be rejected. The test statistic is asymptotically \chi^2-distributed with m degrees of freedom, m being the number of restrictions
LR \ = \ 2 \left( \ln L(\hat{\boldsymbol \theta}_{ur}) - \ln L(\hat{\boldsymbol \theta}_r) \right) \ \stackrel{a}{\sim} \ \chi^2(m)
-
The third test concept, which is also derived from the Maximum Likelihood principle, is the Lagrange Multiplier test (LM)
- With this procedure, only the restricted model is estimated. To obtain the parameter of the restricted model the log-likelihood function is maximized with respect to m constraints \mathbf c(\boldsymbol \theta)=\mathbf 0, using the Lagrange function
\underset{\boldsymbol \theta}{\max}: \ \ln L(\boldsymbol \theta; \mathbf y, \mathbf X) - \boldsymbol \lambda' \mathbf c(\boldsymbol \theta)
- The first order condition shows that a restriction is only binding (has an influence on the estimated parameters) if the corresponding element of the (m \, \text{x} \, 1) Lagrange multiplier vector \boldsymbol \lambda is different from zero
\underbrace {\dfrac {\partial }{\partial \boldsymbol \theta} \ln L(\boldsymbol \theta; \mathbf y, \mathbf X)}_{\mathbf g(\boldsymbol \theta)} - \boldsymbol \lambda' \dfrac {\partial }{\partial \boldsymbol \theta} \mathbf c(\boldsymbol \theta) = \mathbf 0
- If the restriction is of the form \theta_j = c_j, with c_j some constants, the corresponding derivative of the constraint is one, \dfrac {\partial }{\partial \theta_j} \mathbf c(\boldsymbol \theta) = 1 and then, from the first order conditions, we have
\dfrac {\partial }{\partial \theta_j} \ln L(\boldsymbol \theta; \mathbf y, \mathbf X) = \lambda_{\tilde j}
Thus, the vector \boldsymbol \lambda, which is often interpreted as shadow prices for the restrictions \theta_j = c_j, is equal to the corresponding derivatives of the log-likelihood function. Therefore, \lambda_{\tilde j} measures the extent to which the first order conditions with respect to to the unconstrained model are violated (without restriction, the corresponding derivative should differ from \mathbf 0 only because of sampling errors)
Hence, one idea is to test whether \boldsymbol \lambda is zero. A similar and more common test is directly based on the gradient vectors \mathbf g(\boldsymbol \theta_r), the score vectors, instead of \boldsymbol \lambda; hence, the label score test. This as well measures to what extent the gradient differs from \mathbf 0, the expected value of \mathbf g(\boldsymbol \theta) in the unconstrained optimum
It turns out that \mathbf g(\boldsymbol \theta) is asymptotically normal distributed and we have a “Wald-like” LM test statistic
LM \ = \ \mathbf g(\boldsymbol \theta_r)' \, \mathbf I(\boldsymbol \theta_r)^{-1} \mathbf g(\boldsymbol \theta_r) \ \ \stackrel{a}{\sim} \ \ \chi^2(m)
with \mathbf I(\boldsymbol \theta_r) equal to of the inverse Information matrix, evaluated at the restricted parameter vector
-
An asymptotic equivalent procedure to the score test is to first run a restricted regression and then, in a second step, an auxiliary regression using the residuals of the first step to test for relaxation of the restrictions
Typically, the test statistic for this variant of the LM test is \, n \cdot R^2, which is \chi^2(m) distributed
-
This procedure is especially suited for testing misspecifications, as in this case the restricted variant is typically more easily to estimate than the unrestricted model. Examples are tests for
Heteroskedasticity (Breusch-Pagan test)
Serial correlation (Breuch-Godfrey test)
General misspecifications (RESET)
Endogenous regressors (Hausman-Wu test)
Tests for exclusions of variables or linear restrictions among variables are typically more easily to carry out with a Wald test
-
All three test procedures introduced – Wald tests, LR tests and LM tests – are only suitable for nested models and are asymptotically equivalent
- The small sample properties of LR and LM tests are generally unknown and the LM test is said to have large small samples distortions. On the other hand, the LM test is known to have a larger power against small deviations from the H_0
Three test procedures
Code
library(magick)
img = image_read("topic3_tests.jpg")
image_scale(img, "600")
10.2.4 Measures of fit
We already dealed with the hit rate in the context binary choice models, with the refinements of sensitivity and specificity
Pseudo R-squared: Compare the maximized log-likelihood of the model with that of a model that only contains a constant (and no explanatory variables); McFadden’s \tilde{R}^2
\tilde{R}^{2}=1- \dfrac {\ln L_{u r}} {\ln L_{0}}
-
Correlation based measures: Look at correlation (or squared correlation) between predicted values or predicted probabilities versus observed values
\operatorname {Corr} (y_i, \hat y_i)
- In the context of continuous dependent variables models, this corresponds to the usual R^2
-
Information criteria:
Akaike information criterion, AIC = -2 \ln L + 2k
-
Bayesian (Schwarz) information criterion BIC = -2 \ln L + k \ln n
Both are a measure of the information loss due to an imperfect model. They deal with the trade-off between the goodness of fit and the simplicity of the model, thereby considering both the risk of overfitting and underfitting
They are based on the log-likelihood of the model but place a penalty to the number of estimated parameters like the R-squared bar
These criteria can be used to judge also on non-nested models (if y is the same): The lower the better – this strategy is said to lead to better out of sample predictions
10.2.5 Digression: Intoduction to Bayesian Statistics
For an introduction to Bayesian statistics watch that video:
https://www.youtube.com/watch?v=Pahyv9i_X2k
As the video above shows, the basic idea of Bayesian statistics is to update prior information about a parameter vector \boldsymbol \theta with observed data. Bayes Theorem implies that the posterior distribution of the parameter vector \boldsymbol \theta, f(\boldsymbol \theta|data), is proportional to the likelihood function f(data|\boldsymbol \theta) times the prior distribution f(\boldsymbol \theta). An important economic application is estimation of DSGE models.
However, the posterior distribution is often very complicated and difficult or impossible to integrate. For instance, we need to integrate to get the expectation of the posterior E(\boldsymbol \theta|data) = \int_\theta \boldsymbol \theta f(\boldsymbol \theta|data) d\boldsymbol \theta.
A solution of this problem are sampling algorithms, which draw samples of \boldsymbol \theta from an arbitrary proposal distribution in a way so that we finally get draws that are asymptotically distributed like the posterior distribution. If we have such a large sample of \boldsymbol \theta from the posterior it is very easy to get the expectation of \boldsymbol \theta (its simply the mean) or credible intervals (the Bayesian counterparts to confidence intervals).
The Markov Chain Monte Carlo (MCMC), and in particular the Metropolis Hastings (MH) algorithms are such powerful sampling algorithms. By the way, this algorithms are also very important in many other fields like climate modeling, physics or big data econometrics. To learn more watch the following video:
https://www.youtube.com/watch?v=U561HGMWjcw
10.3 Count data
Count data arise when the dependent variable is a counting number per unit of time or space – e.g., the number of restaurants meals taken by a consumer during a week or the number of arrest of people during a year. So, y_i = \lbrace 0, 1, 2, \ldots, \rbrace
If these numbers are large and widespread, the variable can be considered as continuous and hence, OLS would be appropriate
-
If these numbers are small, quite rare events, then the continuous approximation is a poor one
In this case, OLS can still yield consistent estimates and the predicted values are interpreted as conditional expected values of the independent variable
However, OLS does not account for the special structure of the data, leading to inefficiency and nonsense predictions, like -0.2 restaurant meals per week
One remedy to avoid negative predictions is to postulate
E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}
and apply NLS. However, this would be not efficient as the the error term is heteroscedastic
Therefore, maximum likelihood estimation is usually employed for this type of models. As count data are often Poisson distributed this leads to the Poisson regression model (or to the more general negative binomial regression model)
-
The Poisson distribution is derived form a limit of the binomial distribution, which we already encountered with the Logit and Probit models. If the number of independent (Bernoulli) draws goes to infinity and the success probability p goes to zero, so that n\cdot p \rightarrow \lambda, the binomial distrib. converges to
P(Y = y_i) = \dfrac {\lambda^{y_i}}{y_i!} e^{-\lambda}
And this is called Poisson distribution, with y_i the number of counts
The parameter \lambda is equal to both the expectation and the variance of counts! This distribution is considerably right-skewed for a small \lambda, but for \lambda>30, it’s almost identical to a N(\lambda, \lambda) normal distribution
-
As we are interested in the effects of the explanatory variables on y (the number of counts), we must look at the Poisson distribution conditional on \mathbf x. We only have to specify \lambda = E[y \, | \, \mathbf x], which we set to
\lambda = E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}
as above – \lambda has to be positive!
Given a sample of n observations, we can construct the log-likelihood function, which is the joint log-probability that each y_i is observed
\ln L(\boldsymbol \beta) \, = \, \sum\nolimits_{i=1}^{n} y_i \ln \lambda - \lambda - \ln(y_i!) \, = \, \sum\nolimits_{i=1}^{n} y_i \, {\mathbf x_i \boldsymbol \beta} - e^{\mathbf x_i \boldsymbol \beta} - \ln(y_i!)
Note, that the last term, \ln(y_i!), is independent of \boldsymbol \beta and could be omitted
The estimates \hat{\boldsymbol \beta} are obtained by numerical maximization of the log-likelihood function
-
As we have
E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}
the partial effects are
\dfrac {\partial E[y \, | \, \mathbf x]}{\partial x_j} = {\underbrace {e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}}_{E[y \, | \, \mathbf x]}} \, \beta_j \ \ \Rightarrow \tag{10.2}
\dfrac {\partial E[y \, | \, \mathbf x]}{\partial x_j} \dfrac {1}{E[y \, | \, \mathbf x]} \ = \ \dfrac {\partial \ln E[y \, | \, \mathbf x]}{\partial x_j}{} \ = \ \beta_j \tag{10.3}
This is a semi-elasticity. A change in x_j of 0.01 leads to a \beta_j percentage change in E[y \, | \, \mathbf x]. For larger variations in x_j (e.g. for a dummy variable) use the exact formula (e^{\beta_j \Delta x_j}-1)100=\Delta%
The Poisson regression model assumes that \operatorname{Var}(y|\mathbf x) = E(y|\mathbf x) which is very restrictive and often violated in applied work
If \operatorname{Var}(y|\mathbf x) > E(y|\mathbf x), we have overdispersion (which is quite common), in the reverse case underdispersion (which is not so common)
Fortunately, MLE of the Poisson regression model are consistent and asymptotically normal even if the Poisson distribution does not hold. This is then called Quasi-Maximum Likelihood estimation (QML). (This is similar to the case of OLS when the disturbance is not normally distributed)
-
If the distributional assumption does not hold and QML is used, standard errors are wrong. One then has to compute adjusted standard errors
The residuals of the Poisson model are as usual \hat u_i = y_i-\hat y_i. To judge whether there is overdispersion the residuals are scaled (divided) by \sqrt{\hat y_i}. This is a proper heteroscedasticity correction. Additionally, \hat y is a predictor for E(y) and therefore also for \operatorname{Var}(y|\mathbf x). Hence, the variance of these scaled residuals should be one if the condition \operatorname{Var}(y|\mathbf x) = E(y|\mathbf x) is true
If the calculated standard error of these transformed residuals \hat \sigma is larger than one we have overdispersion. In this case the estimated standard errors of the coefficients should be multiplied by \hat \sigma
10.3.1 Example
As an example we analyze the
crime1
data set of theWooldridge
Package. The dependent variablenarr86
is the number of times a man is arrested during 1986. This variable is zero for 1970 of the 2725 men in the sample and only eight values ofnarr86
are larger than five. So, the Poisson regression model is more appropriate than OLS-
Some interesting exploratory variables are
-
pcnv
: proportion of prior convictions -
tottime
: time in prison since 18, in month -
ptime86
: month in prison during 1986 -
inc86
: legal income 1986, in $100s -
black
: =1 if black -
hispan
: =1 if Hispanic
-
We estimate the model with OLS, as well as Poisson regression model and Quasi-Poisson regression model, the latter only differ by the correction of the standard errors because of overdispersion
Additionally, we calculate the APE, which are comparable to the OLS estimates
The OLS estimate for
pcnv
is -0.132 which means that with \Deltapcvn
= 0.1, the expected number of arrest during 1986 decreases by 0.013The corresponding coefficient of the Poisson model is 0.402. But this is now a semi-elasticity: With \Delta
pcvn
= 0.1, the expected number of arrest during 1986 decreases by 4.02%. Calculating the APE (which averages Equation 10.2) with the Packagemarginaleffects
, we get the absolute effect of -0.162, which is now comparable to the OLS estimateThe effect of
black
in the OLS model is 0.327, meaning that for black men the expected number of convictions is 0.327 higher than for white menThe Poisson coefficient for
black
is 0.661. In this case (a large \Delta x_j), we should use the precise formula, e^{\beta_j \Delta x_j}-1, leading to exp(0.661)-1, which is about 94% higher than for white man. Calculating the APE (which averages Equation 10.2), we get the effect in absolute terms, 0.328, which is about the same as the estimated OLS effectThe SE of the transformed residuals of the Poisson model is 1.232, showing a moderate overdispersion. The column QPoisson shows the corrected standard errors for the estimated \beta_j
The estimated R^2, which is the squared correlation of actual versus predicted values of
narr86
, shows that the overall fit of the Poisson regression model is better than that of the linear regression model, although the OLS coefficients are chosen to maximize R^2. But OLS is restricted to a linear relationship
library(wooldridge)
data("crime1")
# OLS
outlm <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 +
inc86 + black + hispan + born60 ,
data = crime1)
# Poisson - family = poisson and the default link is "log",
# so the inverse default link is "exp"
outpoisson <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 +
inc86 + black + hispan + born60 ,
data = crime1, family = poisson)
# Quasi-Poisson - this only adjusts the standard errors
outquasipoisson <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 +
inc86 + black + hispan + born60 ,
data = crime1, family = quasipoisson)
# Calculating APE with "marginaleffects"
marg_lm <- avg_slopes(outlm)
marg_pois <- avg_slopes(outpoisson)
marg_qpois <- avg_slopes(outquasipoisson)
Code
# Reset for modelsummery for glm()
glance_custom.glm <- function(x, ...) {
}
modelsummary( list("OLS"=outlm, "Poisson"=outpoisson, "Poisson APE"=marg_pois,
"QPoisson"=outquasipoisson, "QPoisson APE"=marg_qpois),
output="flextable",
gof_omit = "F",
align = "lddddd",
stars = TRUE,
fmt = 3)
|
OLS |
Poisson |
Poisson APE |
QPoisson |
QPoisson APE |
---|---|---|---|---|---|
(Intercept) |
0.577*** |
-0.600*** |
-0.600*** |
||
(0.038) |
(0.067) |
(0.083) |
|||
pcnv |
-0.132** |
-0.402*** |
-0.162*** |
-0.402*** |
-0.162*** |
(0.040) |
(0.085) |
(0.035) |
(0.105) |
(0.043) |
|
avgsen |
-0.011 |
-0.024 |
-0.010 |
-0.024 |
-0.010 |
(0.012) |
(0.020) |
(0.008) |
(0.025) |
(0.010) |
|
tottime |
0.012 |
0.024+ |
0.010+ |
0.024 |
0.010 |
(0.009) |
(0.015) |
(0.006) |
(0.018) |
(0.007) |
|
ptime86 |
-0.041*** |
-0.099*** |
-0.040*** |
-0.099*** |
-0.040*** |
(0.009) |
(0.021) |
(0.008) |
(0.025) |
(0.010) |
|
qemp86 |
-0.051*** |
-0.038 |
-0.015 |
-0.038 |
-0.015 |
(0.014) |
(0.029) |
(0.012) |
(0.036) |
(0.014) |
|
inc86 |
-0.001*** |
-0.008*** |
-0.003*** |
-0.008*** |
-0.003*** |
(0.000) |
(0.001) |
(0.000) |
(0.001) |
(0.001) |
|
black |
0.327*** |
0.661*** |
0.328*** |
0.661*** |
0.328*** |
(0.045) |
(0.074) |
(0.045) |
(0.091) |
(0.055) |
|
hispan |
0.194*** |
0.500*** |
0.235*** |
0.500*** |
0.235*** |
(0.040) |
(0.074) |
(0.040) |
(0.091) |
(0.050) |
|
born60 |
-0.022 |
-0.051 |
-0.020 |
-0.051 |
-0.020 |
(0.033) |
(0.064) |
(0.025) |
(0.079) |
(0.031) |
|
Num.Obs. |
2725 |
2725 |
2725 |
2725 |
2725 |
R2 |
0.072 |
||||
AIC |
6721.4 |
4517.5 |
4517.5 |
||
BIC |
6786.4 |
4576.6 |
4576.6 |
||
Log.Lik. |
-3349.678 |
-2248.761 |
-2248.761 |
||
RMSE |
0.83 |
0.83 |
0.83 |
0.83 |
0.83 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
10.4 Censored and truncated data
In many economic contexts, decision problems are such that either a positive amount or a zero amount is chosen, e.g., the demand for cars or alcohol (corner solutions). In this case, we have a form of censored data and the distribution of car expenditures is a combination of a discrete distribution (at 0) and a continuous distribution
Another form of censored data is when the dependent variable is censored from above or below a certain threshold, typically due to survey design; e.g. if respondents of a survey are allowed to respond a question about their wealth with more than 1 Mil. Typically, there is no optimization problem modeled
In the first example for censored data we have expenditure data for buyers and non-buyers (which are zero). However, if the data are collected from sales tax records, then the data would only include buyers
-
Data, in which observations are unavailable above or below a certain threshold are called truncated data. Note, for truncated data we neither observe y_i nor \mathbf x_i, in contrast to censored data. The sample is not random any more if the selection mechanism is related to the value of the dependent variable
-
Remark: Selection mechanisms which are exogenous, i.e., the dependent variable and its error term are not involved, pose no special estimation problems
-
Example:
Suppose we want to estimate the relationship between the saving rate and income, but we have oversampled people with high income (exogenous sampling mechanism, which is based on an exogenous (rhs) variable). If the same relationship between savings and income is true for all people, this would not lead to biased estimates, (although with higher variance – less variation in the explanatory variable)
However, if we have oversampled people with a high saving rate (endogenous sampling mechanism) this would lead to biased estimates even if the same relationship between savings and income is true for all people (sample selection problem)
-
-
A linear regression model may be inadequate for both censured and truncated data
Figure 10.4 below shows the effects of censored and truncated data for estimating the true (population) relationship. In both cases we see a significant bias towards zero in the estimates based on the observed data. Note, regarding the truncated sample, we have an endogenous sampling mechanism related to the value of y – thus there is a sample selection problem
Code
library("ggplot2")
library("dplyr")
library("tidyverse")
baw <- FALSE
set.seed(2020)
N <- 1000
X = rnorm(N, 3, 4)
Y = - 1 + 0.5 * X + rnorm(N, 0, 1.8)
z <- data.frame(X = X, Y = Y)
z <- mutate(z,
neg = factor(Y < 0, labels = c("yes", "no")),
Yc = ifelse(Y > 0 , Y, 0),
Yt = ifelse(Y > 0, Y, NA))
z2 <- gather(z, key, value, -X, - neg)
colols <- ifelse(baw, "black", "#AA4444")
ys <- c("Y" = "Whole sample", "Yc" = "Censored sample", "Yt" = "Truncated sample")
gp <- ggplot(z2, aes(X, value)) + geom_point(size = 0.6, aes(colour = neg)) +
geom_smooth(method = "lm", se = FALSE, lty = "twodash", color = colols) +
geom_abline(intercept = -1, slope = 0.5) +
facet_wrap(~ key, ncol = 3, labeller = labeller(key = ys)) +
guides(colour = FALSE) + xlab("") + ylab("") +
theme(legend.text = element_text(size = 6),
legend.title= element_text(size = 8),
axis.title = element_text(size = 8))
if (baw) gp <- gp + scale_color_grey()
gp
Some examples for typical count and censored data
library(wooldridge);
data("crime1")
# Number of arrest - typical count data
hist(crime1$narr86, main = "Typical count data, should look like a Poisson distribution")
library(wooldridge);
data("mroz")
# Number of hours worked - not everybody work
# Data apparently not Poisson distributed
hist(mroz$hours, main = "Typical censored data, (>= 0)")
library(wooldridge);
data("recid")
# Duration time until next arrest. We have different release times from prison,
# hence different thresholds as the endpoint for observations is fixed
hist(recid$durat, main = "Censored data from above with different but known thresholds")
10.4.1 Truncated Normal Regression Model
Let’s start with the analysis of truncated regression models. Characteristic for these is that the outcome y_i and the explanatory variables \mathbf x_i are only observed if the outcome y_i is less (larger) or equal some value c_i
In this case, the sample is not a random sample from the population (because some units will never be a part of the sample)
The truncated normal regression model is:
y_i = \mathbf x_i \boldsymbol \beta + u_i
u_i \, | \, \mathbf x_i \sim N(0, \sigma^2)
(y_i, \mathbf x_i) \ \text{ only observed if } \ y_i > c_i \ \ \ (\text{or } \ y_i < c_i) \tag{10.4}
Truncation from below or left: observe y_i and \mathbf x_i if y_i > c_i
Truncation from above or right: observe y_i and \mathbf x_i if y_i < c_i
In the following we examine truncation from below. Truncation from above is handled analogous and does not require any new results, Green (2017)
Truncated normal distribution
Theorem 10.1 (Truncated normal distribution, see Green (2017)) Let y be a normally distributed random variable with density f(y), mean E(y)=\mu and variance \operatorname {Var}(y)=\sigma^2 and a cutoff constant c. Then the conditional density function f(y|y>c) is
f(y) \ \equiv \ f(y|y>c)\cdot P(y>c) \ \ \Rightarrow
f(y|y>c) = \frac{f(y)}{P(y>c)} = \frac{f(y)}{P(y-\mu>c-\mu)} =
\frac{f(y)}{P[(y-\mu)/\sigma > (c-\mu)/\sigma]} \, = \, \frac{f(y)}{1-F[(c-\mu)/\sigma]} \, =
\frac{\frac{1}{\sigma}\phi \left( \dfrac{y-\mu}{\sigma}\right)}{1-\Phi\left(\dfrac{c-\mu}{\sigma}\right)} \ \tag{10.5}
As already defined above, \phi(z) denotes the probability density and \Phi(z) the cumulative distribution function (cdf) of the standard normally distribution (see Equation A.3 and Equation A.4).
The mean and variance are given by:
\begin{align*}E(y|y>c) &= \mu + \sigma \, \lambda \left(\frac{\mu-c}{\sigma} \right) \\ \operatorname{Var}(y|y>c) &= \sigma^2 \left[ 1-\delta \left(\frac{\mu-c}{\sigma}\right) \right] \end{align*} \tag{10.6}
where \lambda(x) is the Inverse Mills Ratio (or hazard function of the standard normal distribution): 1
\lambda(x) := \dfrac {\phi(x)}{\Phi(x)}, \ \ \lambda(x) >0, \ \lambda'(x) <0, \ \forall x \tag{10.7}
\delta(x) := \lambda(x)[\lambda(x)-x], \ \ 0<\delta(x)<1
Remarks: E(y|y>c) > E(y) and \operatorname {Var}(y|y>c) < \operatorname {Var}(y).
The Inverse Mills Ratio \lambda(x) (IMR) is fundamental in models with incomplete data or self-selective samples. It can be interpreted as a measure of the probability that a particular observation is selected (is in the sample) compared to those that are not selected (not in the sample). A lower IMR indicates a higher probability of selection.
Code
x <- seq(-3, 3, length=100)
# plotting normal distribution
plot(x, dnorm(x), type="l", ylim=c(0, 1.125))
# plotting truncated normal distribution, truncated from below at -0.5
a=-0.5
xt <- seq(a, 3, length=100)
lines(xt, dnorm(xt)/(1-pnorm(a)) )
o <- dnorm(a)/(1-pnorm(a))
segments( a,0, x1=a, y1=o, lty = 3)
# plotting truncated normal distribution, truncated from below at 0.5
a=0.5
xt <- seq(a, 3, length=100)
lines(xt, dnorm(xt)/(1-pnorm(a)) )
o <- dnorm(a)/(1-pnorm(a))
segments( a,0, x1=a, y1=o, lty = 3)
Estimation
Applying OLS would not yield correct results because MLR.2 is violated (see Section 2.7). To see this note that, according to the theorem above, the expectation of y_i for the observed subpopulation y_i > c_i is:
E(y_i|y>c_i) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \lambda \left( \frac{\mathbf x_i \boldsymbol \beta - c_i }{\sigma} \right) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \tag{10.8}
-
Hence, regressing y_i only on \mathbf x_i leads to an omitted variable bias as the omitted variable \lambda(\mathbf x_i) (the inverse Mills Ratio, Equation 10.7) is a nonlinear function of \mathbf x_i and therefore probably correlated with \mathbf x_i
- In particular, for observations with a low a-priory probability to be in the sample (because \mathbf x_i \boldsymbol \beta is low), the variable \lambda(\mathbf x_i) is high, thereby correcting for the selection bias in Equation 10.8
The partial effect of x_j on E(y_i|y>c_i), which we want to estimate, is:
\dfrac {\partial E(y_i|y>c_i)}{\partial x_j} \, = \, \beta_j + \sigma \, \lambda' \, \dfrac {\partial \left( \frac{\mathbf x_i \boldsymbol \beta - c_i }{\sigma} \right) }{\partial x_j} \, = \, \beta_j(1 + \underset {(-)} {\lambda'}) \tag{10.9}
- This can be shown to be less than \beta_j. Hence, we have a bias towards zero (compare Figure 10.4) 2
As just shown, OLS is biased and inconsistent for truncated data because of an omitted variable problem.
Thus, Maximum Likelihood estimation seems appropriate.
- According to Theorem 10.1, the density of an observed outcome y_i, conditional on the explanatory variables and the y > c_i is:
f\left(y_{i} \, | \, \mathbf{x}_{i}, y_i > c_{i}\right) \ = \ \frac{\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} {1 - \Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \tag{10.10}
Maximizing the log-Likelihood function yields consistent estimates of \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2
\underset{\boldsymbol \beta, \sigma}{\max} \ \ln L(\boldsymbol \beta, \sigma^2) \ = \ \sum_{i=1}^n \ln \frac {1}{\sigma }{\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} - \ln \left[ 1 - {\Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \right]
\rightarrow \ \ \ \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2
Note, that non-normality or heteroscedasticity in u_i lead to inconsistent estimates
Further note, the untruncated data should be approximately continuous to be appropriately modeled with this approach (not count data with only few counts!)
Example
Code
# We simulate data and store them in the data.frame d
set.seed(1234890)
n <- 10000
# true parameters, which we want to estimate
sigma <- 4
beta_0 <- 2
beta_1 <- 1;
x <- rnorm(n, mean = 0, sd = 2);
eps <- rnorm(n, sd = sigma)
# the true model
y <- beta_0 + beta_1 * x + eps
d <- data.frame(y = y, x = x)
library(truncreg) # Truncated Regression package
# generating truncated data y = yt, at c <= 1
d$yt <- ifelse(d$y > 1, d$y, NA)
# compare estimates for full/truncated/ data
# we use the function `truncreg` from the package `truncreg`
full <- lm(y ~ x, data = d)
ols <- lm(yt ~ x, data = d)
trunc <- truncreg(yt ~ x, data = d, point = 1, direction = "left")
summary(trunc)
Call:
truncreg(formula = yt ~ x, data = d, point = 1, direction = "left")
BFGS maximization method
45 iterations, 0h:0m:0s
g'(-H)^-1g = 7.38E-09
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 2.119760 0.170904 12.403 < 2.2e-16 ***
x 0.994784 0.046537 21.376 < 2.2e-16 ***
sigma 3.972827 0.079421 50.022 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -13547 on 3 Df
Code
modelsummary( list("OLS full data"=full, "OLS trunc data"=ols, "MLE trunc data"=trunc),
output="flextable",
gof_map = c("nobs", "aic", "bic", "rmse"),
align = "lddd",
stars = TRUE,
fmt = 3)
|
OLS full data |
OLS trunc data |
MLE trunc data |
---|---|---|---|
(Intercept) |
2.081*** |
4.707*** |
2.120*** |
(0.040) |
(0.037) |
(0.171) |
|
x |
1.035*** |
0.474*** |
0.995*** |
(0.020) |
(0.019) |
(0.047) |
|
sigma |
3.973*** |
||
(0.079) |
|||
Num.Obs. |
10000 |
5967 |
5967 |
AIC |
56035.8 |
29009.9 |
27099.7 |
BIC |
56057.4 |
29030.0 |
27119.8 |
RMSE |
3.99 |
2.75 |
3.69 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
- Compare Figure 10.4
10.4.2 Censored Normal Regression Model
Generally, censored data do not result from a truncation problem. Contrary to truncated data, especially the explanatory variables \mathbf x_i are observed even if the outcome y_i is less (larger) or equal some value c_i
-
Typically, a dependent variable is censored from above or below a certain threshold due to survey design; e.g., values are only reported up to a certain level (e.g., top coded wealth)
- Another example for censoring from below is limited measurement accuracy, for instance tumor maker values or count of antibodies in medical data
With y_i^* being the true outcome and y_i being the observed outcome, the censored normal regression model is
y_i^* = \mathbf x_i \boldsymbol \beta + u_i
u_i \, | \, \mathbf x_i \sim N(0, \, \sigma^2)
y_i = \begin{cases} \max(y_i^*,c_i) \ \ \ \text{:left censoring (from below)} \\ \min(y_i^*,c_i) \ \ \ \ \text{:right censoring (from above)} \\ \max(y_i^*,\underline {c}_i) \ \land \ \min(y_i^*,\bar c_i) \ \ \ \text{:censoring from above and below} \end{cases} \tag{10.11}
- In the following we examine censoring from below. Censoring from above is handled analogous
Distribution of censored data
For censored data, the distribution function of y_i is a mixture of a discrete distribution, concentrated at the threshold c_i, and a truncated continuous density; the normal density in our case
-
Such a distribution is shown in Figure 10.6 above with threshold c_i=0
- Note, in this case no renormalization of the truncated continuous density is necessary, because the area below the truncated density function plus the probability of the discrete mass point is one
-
We have two parts of the distribution of y_i (all conditional on \mathbf x_i, which we leave out to save notation):
- For (left) censored observations, y_i = c_i, we have a discrete mass point with probability:
P(y_i = c_i) \,=\, P(y_i^* \leq c_i) \,=\, P(u_i \leq c_i - \mathbf x_i \boldsymbol \beta) \,=\, P \left( \frac {u_i}{\sigma} \leq \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \,=
\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.12}
- For the uncensored observations, y_i = y_i^* > c_i, the density of y_i, conditional on y_i>c_i, corresponds to a truncated density f\left(y_{i} \, | \, y_i > c_{i}\right), already derived in the section on truncated data, see Equation 10.5 and Equation 10.10
For the whole two-part distribution of y_i we have
f(y_i \,|\, \mathbf x_i, c_i) \ = \begin{cases} P(y_i=c_i) = \Phi \left( \dfrac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right); \ \ \text{ if } y_i=c_i \\ f(y_{i} \, | \, y_i > c_{i}) \cdot P(y_i>c_i); \ \ \text{ if } y_i>c_i \end{cases} \tag{10.13}
- The second part simplifies to
f(y_{i} \, | \, y_i > c_{i}) \cdot P(y_i>c_i) \ = \ \underbrace {\frac{\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} {1 - \Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)}}_{ f(y_{i} \, | \, y_i > c_{i})} \underbrace {\left [1 - \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \right ]}_{P(y_i>c_i)} \ =
\underbrace {\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)}_{f(y_i^*)=N(\mathbf x_i \boldsymbol \beta, \, \sigma^2)} \tag{10.14}
- And this is simply the non-truncated normal distribution of the uncensored observations (as mentioned above, there is no need for a renormalization)
Estimation
This lead to the joint density of the observed vector \mathbf y:
f(\mathbf y \, | \, \mathbf X,c_i) \ = \ \prod_{y_i=c_i} \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \cdot \prod_{y_i>c_i} \frac{1}{\sigma} \phi \left( \frac {y_i-\mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.15}
- Maximizing the log-likelihood function yields consistent estimates of \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2
\underset{\boldsymbol \beta, \sigma}{\max} \ \ln L(\boldsymbol \beta, \sigma^2) \ = \ \sum_{y_i=c_i} \ln \left[ {\Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \right] \ + \ \sum_{y_i>c_i}^n \ln \frac {1}{\sigma }{\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} \ \ \rightarrow \ \ \hat {\boldsymbol \beta}, \hat \sigma^2
Estimation has to be done by numerical optimization methods
After estimation, parameter test and exclusion restrictions can be done as usual with Wald, Likelihood Ratio or Lagrange Multiplier tests
Note, if the distributional assumptions on u_i (normality and homoscedasticity) are violated, the MLE above is inconsistent
Further note, the uncensored data should be approximately continuous to be appropriately modeled with this approach (not count data with only few counts!)
OLS applied to the observed data doesn’t yield consistent estimates for \boldsymbol \beta as the expectation of y_i, conditional on \mathbf x_i and y_i\geq c_i, is not \mathbf x_i \boldsymbol \beta but a highly nonlinear function in \mathbf x_i and \sigma
E(y_i) \ = \ E(y_i|y \geq c_i) \ = \ E(y_i|y_i > c_i) \cdot P(y_i > c_i) \, + \, \underbrace {E(y_i|y_i = c_i)}_{c_i} \cdot P(y_i = c_i)
Plugging in our known results regarding the truncation case E(y_i|y_i > c_i) (Equation 10.8) and for P(y_i = c_i) (Equation 10.12), respectively P(y_i > c_i), we have
E(y_i) \ = \ \underbrace {\left[ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \right]}_{E(y_i|y_i > c_i)} \underbrace {\left[ 1 -\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \right]}_{P(y_i>c_i) \, = \, [1-P(y_i=c_i)]} \, + \, c_i \, \underbrace {\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right)}_{P(y_i=c_i)} \ =
\left[ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \right] \underbrace {\left[ \Phi \left( \frac {\mathbf x_i \boldsymbol \beta - c_i}{\sigma} \right) \right]}_{P(y_i>c_i)} \, + \, c_i \, \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \ =
\mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma} \right) \, + \, c_i \, \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.16}
Example continued
Code
# simulated data - example from above continued
set.seed(1234890)
n <- 1500 # fewer observation to better judge efficiency
# true parameters, which we want to estimate
sigma <- 4
beta_0 <- 2
beta_1 <- 1;
x <- rnorm(n, mean = 0, sd = 2);
eps <- rnorm(n, sd = sigma)
# the true model
y <- beta_0 + beta_1 * x + eps
d <- data.frame(y = y, x = x)
library(truncreg) # Truncated Regression package
library(censReg) # Censored Regression package
## truncated y = yt, censored y = yc, at c = 1
d$yt <- ifelse(d$y > 1, d$y, NA)
d$yc <- ifelse(d$y > 1, d$y, 1)
## compare estimates for full/truncated/censored data
# ols full data
full <- lm(y ~ x, data = d)
# ols truncated data
ols <- lm(yt ~ x, data = d)
# ols censored data
olsc <- lm(yc ~ x, data = d)
# MLE truncated data
trunc <- truncreg(yt ~ x, data = d, point = 1, direction = "left")
# MLE censored data
censored <- censReg(yc ~ x, data = d, left = 1)
summary(censored)
Call:
censReg(formula = yc ~ x, left = 1, data = d)
Observations:
Total Left-censored Uncensored Right-censored
1500 594 906 0
Coefficients:
Estimate Std. error t value Pr(> t)
(Intercept) 2.08976 0.11893 17.57 <2e-16 ***
x 0.90929 0.05969 15.23 <2e-16 ***
logSigma 1.38363 0.02540 54.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Newton-Raphson maximisation, 5 iterations
Return code 1: gradient close to zero (gradtol)
Log-likelihood: -2975.228 on 3 Df
Code
modelsummary( list("OLS full data"=full, "OLS trunc data"=ols, "MLE trunc data"=trunc,
"OLS cens data"=olsc, "MLE cens data"=censored),
output="flextable",
gof_map = c("nobs", "aic", "bic", "rmse"),
align = "lddddd",
stars = TRUE,
fmt = 3)
|
OLS full data |
OLS trunc data |
MLE trunc data |
OLS cens data |
MLE cens data |
---|---|---|---|---|---|
(Intercept) |
2.039*** |
4.716*** |
2.305*** |
3.352*** |
2.090*** |
(0.104) |
(0.096) |
(0.419) |
(0.071) |
(0.119) |
|
x |
0.964*** |
0.397*** |
0.827*** |
0.526*** |
0.909*** |
(0.053) |
(0.050) |
(0.116) |
(0.036) |
(0.060) |
|
sigma |
3.920*** |
||||
(0.201) |
|||||
logSigma |
1.384*** |
||||
(0.025) |
|||||
Num.Obs. |
1500 |
906 |
906 |
1500 |
1500 |
AIC |
8439.7 |
4391.6 |
4117.3 |
7274.4 |
5956.5 |
BIC |
8455.6 |
4406.0 |
4131.8 |
7290.3 |
5972.4 |
RMSE |
4.02 |
2.72 |
3.55 |
2.73 |
|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
10.4.3 Tobit model
-
Strictly speaking, the Tobit model, Tobin (1958), is formally identical to a censored normal regression model. However, there are two distinct features, one minor and one important
In most cases, the Tobit model is formulated with left-censoring and c_i=0. This is not important and can easily be changed
More importantly, the Tobit model is based on an optimization model with corner solutions and not on censoring of data above or below a certain threshold due to certain survey designs
This different starting point has serious consequences for interpreting the estimation results, especially for calculating the relevant partial effects
Let us consider an example for an underlying optimization problem with corner solutions for some agents i which maximize their utility U(y_i, z_i) subject to a budget constraint p_yy_i+p_zz_i \leq I_i
y_i and z_i could be leisure or working time versus consumption of goods, or consumption of tobacco versus other goods, or consumption today versus consumption tomorrow, ect. The critical point are additional non-negativity constraints; it is not possible to choose negative values, hence, y_i≥0 and z_i≥0
Without the non-negativity constraints the solution of this optimization problem for y_i is (assuming a linear solution which is to be expected with certain utility functions)
y_i^* = f(p_y, p_z, I_i, u_i) = \mathbf x_i \boldsymbol \beta + u_i
- Here, y_i^* represents the potential choice, without non-negativity constraints and u_i represents unobserved heterogeneity of the individual utility function. However, invoking the non-negativity constraints, we arrive to a standard censored regression model, with y_i being the observed values
y_i^* = \mathbf x_i \boldsymbol \beta + u_i \\ y_i = y_i^* \quad \text{if } \ y_i^* > 0 \\ y_i = 0 \quad \ \text{if } \ y_i^* ≤ 0
- The difference to the standard censored regression model is that in this case, the non-negativity arises because the agent simply can not choose negative values of a good - and not because the data are censored for some reasons
This has consequences for the interpretation of the model. In the standard censored regression model we are interested in the population parameter \boldsymbol \beta, which, in the case of censored or truncated data can not be consistently estimated by OLS. Therefore, we need a maximum likelihood approach
In the Tobit model, with an underlying optimization problem with non-negativity constraints, we are typically interested on how much the value of the actual observed variable y (labor supply in hours or expenditures on cars or tobacco) changes due to changes of the explanatory variables.
Partial effects
In particular, there are actually three different partial effects present in model
- The effect on the expectation of the latent variable – therefore on potential choices (without non-negativity restriction) and not on actual choices. Typically, this is not what we want in this case
\dfrac {\partial E(y^*|\mathbf x)}{\partial x_j} = \beta_j
- The effect on expected actual choices y, but only for uncensored observations, e.g., the choice of already employed people on their hours worked if wage is changing, or the choice about expenditures on tobacco for people which actually are smokers. Sometimes, this is what we want
\dfrac {\partial E(y|\mathbf x, y>0)}{\partial x_j} = \ ?
- This is the effect on expected actual choices y, but for all actual values, independently whether they are censored or not. Most of the time, this is what we want
\dfrac {\partial E(y|\mathbf x, y≥0)}{\partial x_j} = \dfrac {\partial E(y|\mathbf x)}{\partial x_j}= \ ?
Hence, typically, it is the last partial effect we are interested in. For instance, if the wage or the number of children is changing, we want to know the effect on the supplied hours worked coming form already employed as well as form people which are not currently working.
- Therefore, the last partial effect also incorporates the effect on the probability for doing something or not (e.g., working, smoking, driving with a car, etc.)
\dfrac {\partial P(y_i>0|\mathbf x_i)}{\partial x_j} \, = \, \dfrac {\partial \Phi(\mathbf x_i \boldsymbol \beta/\sigma)}{\partial x_j} \, = \, \dfrac {\beta_j}{\sigma} \phi(\mathbf x_i \boldsymbol \beta/\sigma) \tag{10.17}
We already calculated the expectation E(y_i| \mathbf x_i, y_i≥0), (Equation 10.16). But here, we are dealing with the case of c_i=0, so the third term in Equation 10.16 is zero
E(y_i| \mathbf x_i, y_i≥0) \ = \ E(y_i| \mathbf x_i) \ = \ \mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \ =
\Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \left[ \mathbf x_i \boldsymbol \beta \, + \, \sigma \, \lambda \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \right]
- The surprisingly simple partial effect is
\dfrac {\partial E(y_i|\mathbf x_i, y_i≥0)}{\partial x_j} \ = \ \beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.18}
-
Hence, the corresponding coefficient \beta_j has to be multiplied by a unit specific adjustment factor \Phi(\mathbf x_i \boldsymbol \beta / \sigma), which is clearly positive and less than 1
- For calculating the APE, we use the average over all observations of this adjustment factor, i.e.,
\widehat{A P E}_{j} = \hat\beta_{j} \, \frac{1}{n} \sum_{i=1}^{n} \Phi \left(\frac {\mathbf{x}_{i} \hat{\boldsymbol{\beta}}}{\sigma} \right) \tag{10.19}
- For the PEA we use \hat\beta_{j} \, \Phi(\bar{\mathbf x} \boldsymbol \beta / \sigma)
Proof for the partial effect:
- Note, that d \Phi(z) / d z = \phi(z) and d \phi(z) / d z = -z \phi(z)
\dfrac {\partial }{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \right ] \ =
\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) + \mathbf x_i \boldsymbol \beta \, \phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \frac {\beta_j}{\sigma} - \sigma\left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \frac {\beta_j}{\sigma} \ =
\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) + \mathbf x_i \boldsymbol \beta \, \phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right)\frac {\beta_j}{\sigma} - \left (\frac{\mathbf x_i \boldsymbol \beta} {1} \right) \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right)\frac {\beta_j}{\sigma} \ =
\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)
Sometimes, we only want the effect on the choices of already employed people on their hours worked in response to changing wages, or the choice about expenditures on tobacco for people which actually are smokers (the second partial effect mentioned above). In this case we need
E(y_i|\mathbf x_i, y_i>0) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )}
which we already encountered this in Equation 10.8, dealing with the truncated normal distribution. Remember, the function \lambda(z) is the inverse Mills ratio (Equation 10.7) and equals to \phi(z)/\Phi(z)
- Then, the partial effect is
\dfrac {\partial}{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) }\right] \ = \beta_j \left[1 - \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) - \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)^2 \right] \tag{10.20}
- Once again, the corresponding coefficient \beta_j has to be multiplied by a unit specific adjustment factor [ 1 - ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) - \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} )^2 ], which can be shown to be strictly between zero and one. For calculating the APE, we use the average over all observations of this adjustment factor. For the PEA we simply calculate the expression at the mean \bar{\mathbf x}
Proof for the partial effect:
Once again note, that d \Phi(z) / dz = \phi(z), \; d \phi(z) / d z = -z \phi(z) \, and \, \lambda(z)=\phi(z)/\Phi(z)
\text{ } \\ \dfrac {\partial}{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) }\right] \ =
\beta_j + \dfrac{- \sigma ( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma} \, \Phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) - \sigma \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma}\phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )} {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } =
\beta_j + \left[ - \dfrac{ \sigma ( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \frac {\beta_j}{\sigma}\Phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) } {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } - \dfrac{ \sigma \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma} \phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )} {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } \right] \ =
\beta_j \left[1 - \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) - \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)^2 \right]
10.4.4 Example
Now, we are estimating a concrete example using the MROZ data, which we have already studied in the context of Logit and Probit models. We are interested in the number of working hours supplied by married woman, hours
, and the effects of several explanatory variables
These variables are
nwifeinc
= non wife income,educ
= education in years,exper
= experience in years,age
,kidslt6
= number of kids younger than 6 andkidsge6
= number of kids older than 6-
We estimate the model with OLS, Tobit and Probit. The reason for the Probit estimation is a special feature of the Tobit model:
The Tobit estimates implicitly incorporates the choice of doing something (for instance, the probability of working, depending on some variables) as well as the response of the intensity of doing something (e.g., the response of working time on education)
These two effects are: {\partial P(y_i>0|\mathbf x_i)}/{\partial x_j} = (\beta_j/\sigma) \phi(\mathbf x_i \boldsymbol \beta/\sigma) (Equation 10.12, Equation 10.17) and {\partial E(y_i|\mathbf x_i, y_i>0)}/{\partial x_j} = \beta_j [ 1 - ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) - \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} )^2 ] (Equation 10.20) and hence, are represented by the same parameter \beta_j times a positive constant
Therefore, every variable in the model affects the participation probability and the intensity equation with the same sign. Further, the Probit parameters should be approximately equal to \hat \beta_j/\hat \sigma in terms of the Tobit estimates. So, it is always a good idea to additionally estimate a corresponding Probit model (coding y_i > 0 as 1) and check whether the estimated coefficient actually have the same sign. If there are some coefficients with statistically significant different signs, then the Tobit model might not be appropriate
library(boot)
library(wooldridge)
data("mroz")
# OLS
outols <- glm(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, x=TRUE)
# Tobit model - default of `tobit` from `AER` package: left = 0.
# You could also use `censReg` -> same result
outtobit <- tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)
# with censReg - same result but logSigma is reported (but avg_slopes does not work)
#outtobit1 <- censReg(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)
# you can calulate PEA
# margEff(outtobit1)
Code
########## for APEs for modelsummary()
## bootstrap only for std.errors necessary
bootbdy <- function(data, indices) {
dat <- data[indices,] # resampled data
out <- tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
data = dat)
m_tobit <- avg_slopes(out)
fac <- mean( pnorm( predict(out)/out$scale ), na.rm = TRUE ) # scaling factor, eq. 10.7
m_tobit["estimate"] = m_tobit["estimate"]*fac
return( coef(m_tobit) )
}
bootresult_ <- boot(data=mroz, statistic=bootbdy, R=500)
se_boot <- matrixStats::colSds(bootresult_$t) # for debugging, should be the same as mm
mm = summary(bootresult_) # to calculate std.errors
## to use `modelsummary`, making a data.frame (with naming according broom package)
m_tobit1 = data.frame(term=names(bootresult_$t0), estimate=mm$original, std.error=mm$bootSE, statistic=mm$original/mm$bootSE, p.value= (1 - pnorm( abs(mm$original/mm$bootSE) ))*2 )
## modelsummary needs a special format of a list with tidy and glance
m_tobit <- list( tidy = m_tobit1, glance = data.frame(NA) )
class(m_tobit) <- "modelsummary_list"
##############
# Probit model - to check for misspecification
outprobit <- glm( inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
data = mroz, family = binomial(link = "probit"))
# Average partial effects for Probit with 'avg_slopes()'
marg_probit <- avg_slopes(outprobit)
# Average partial effects for Tobit with 'avg_slopes()', but without correction factor fac
marg_tobit <- avg_slopes(outtobit)
Code
rows <- tibble::tribble(~term, ~Bivariate, ~Multivariate, ~ab, ~cd, ~ef,
'Empty row', '-', '-', '-', '-', '-',
'Another empty row', '?', '?', '?', '?', '?')
#attr(rows, 'position') <- c(1, 6)
modelsummary( list("OLS"=outols, "Tobit"=outtobit, "Tobit APE"=m_tobit, "Probit"=outprobit, "Probit APE"=marg_probit),
output="gt",
#statistic = c('std.error', 'statistic', 'p.value'),
#coef_omit = "Inter",
gof_omit = "F|N",
align = "lddddd",
stars = TRUE,
fmt = 3,
add_rows = rows
)
OLS | Tobit | Tobit APE | Probit | Probit APE | |
---|---|---|---|---|---|
(Intercept) | 1330.482*** | 965.305* | 0.270 | ||
(270.785) | (446.436) | (0.508) | |||
nwifeinc | -3.447 | -8.814* | -5.189* | -0.012* | -0.004* |
(2.544) | (4.459) | (2.593) | (0.005) | (0.001) | |
educ | 28.761* | 80.646*** | 47.473*** | 0.131*** | 0.039*** |
(12.955) | (21.583) | (12.649) | (0.025) | (0.007) | |
exper | 65.673*** | 131.564*** | 54.115*** | 0.123*** | 0.026*** |
(9.963) | (17.279) | (4.945) | (0.019) | (0.002) | |
I(exper^2) | -0.700* | -1.864*** | -0.002** | ||
(0.325) | (0.538) | (0.001) | |||
age | -30.512*** | -54.405*** | -32.026*** | -0.053*** | -0.016*** |
(4.364) | (7.419) | (4.259) | (0.008) | (0.002) | |
kidslt6 | -442.090*** | -894.022*** | -526.278*** | -0.868*** | -0.261*** |
(58.847) | (111.878) | (73.498) | (0.118) | (0.032) | |
kidsge6 | -32.779 | -16.218 | -9.547 | 0.036 | 0.011 |
(23.176) | (38.641) | (23.421) | (0.044) | (0.013) | |
R2 | 0.266 | ||||
AIC | 12117.1 | 7656.2 | 818.6 | 818.6 | |
BIC | 12158.7 | 7697.8 | 855.6 | 855.6 | |
Log.Lik. | -6049.534 | -401.302 | -401.302 | ||
RMSE | 746.18 | 953.68 | 0.42 | 0.42 | |
Left_censored | 325 | ||||
Un_censored | 428 | ||||
Right_censored | 0 | ||||
Total | 753 | ||||
Scale | 1122.02*** | ||||
Empty row | - | - | - | - | - |
Another empty row | ? | ? | ? | ? | ? |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
hours
, and the effects of several explanatory variables, estimated with OLS, Tobit, Tobit-APEs (Equation 10.19) and a Probit model for checking signs of parametersCode
#### Code for `texreg` procedure - not reported
library(texreg)
# Calculating APEs for tobit model
# Basically with:
# fav <- mean( pnorm( predict(outtobit)/outtobit$scale ), na.rm = TRUE )
# APE <- fav * coef(outtobit)
source('~/Dropbox/R/User Functions/extract.margins.R') # user function for texreg
source('~/Dropbox/R/User Functions/extract.tobitMargins.R') # user function for texreg
# APE for Tobit with bootstrapping for standard errors. User fn tobitAPE is integrated in extract.tobitMargins.R
bootbody <- function(data, indices) {
out <- tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
data = mroz[indices,])
return( tobitAPE(out) )
}
bootresult <- boot(data=mroz, statistic=bootbody, R=1000)
# preparing for texreg to know tobit_output (append classes: "tobitMarg","APE")
tobit_out <- outtobit
class(tobit_out) <- c("tobitMarg","APE", "tobit", "survreg", "margins")
screenreg( list(outols, outtobit, tobit_out, outprobit, marg_probit ),
custom.model.names = c("OLS", "Tobit", "Tobit APE", "Probit", "Probit APE"),
digits = 3,
bootresults=bootresult, discrete=c(9)
)
Because of the different scaling factors involved, Tobit coefficients are not directly comparable to OLS coefficients. For instance, look at the OLS estimate for \beta_{educ} which is 28.8 and the Tobit estimate of 80.6. But this is the estimated effect on the latent variable, which is not really relevant in this context
To compare Tobit and OLS, one has to compare average partial effects, APEs, Equation 10.18. It turns out that the partial effects of Tobit and OLS are different not in all, but in a number of cases. For instance, the effect of education on E(y|\mathbf x), the expectation of observed hours worked, is estimated to be 47.5
Another difference between Tobit and OLS is that, due to the linearity of the model, OLS assumes constant partial effects, whereas partial effects are nonconstant in Tobit. To demonstrate this, look at Figure 10.8 below. There, the different predictors for hours worked are displayed, along with the observed points. The most relevant one is the blue line which represents E(y|\mathbf x). Note, how the blue line converges to the red one, the linear predictor \mathbf x \hat {\boldsymbol \beta} of the Tobit model, as the censoring becomes less imortant. Further note, the OLS predictor approximate E(y|\mathbf x) quite good for some range of \mathbf x
Lets us compare the estimates of the Tobit model with the Probit ones. It turns out that every coefficient has the same sign. Further, dividing the Tobit estimate for \beta_{educ} by the Tobit estimate for \sigma (which is 1122) we get 0.072 compared to 0.13 for Probit. Doing the same for \beta_{kidslt6} we get -0.8 compared to -0.87. This is not a huge difference but shows that the initial decision to work is stronger affected by education or small kids than the decision about the hours worked once in the labor force
Code
tobitPredict <- function(x, model) { # calculates pred values for tobit model
pred <- pnorm( x/model$scale ) * x + model$scale * dnorm( x/model$scale )
return(pred) }
tobitPredict0 <- function(x, model) { # calculates Lambda values for tobit model
pred0 <- x + model$scale * dnorm( x/model$scale ) / pnorm( x/model$scale )
return(pred0) }
olsout2 <- lm(mroz$hours ~ outtobit$linear.predictors)
plot( outtobit$linear.predictors, mroz$hours, ylim=c(-1000, 5000), pch = 16, cex=0.9,
col=c("#666666", "#666666", "#666666"), xlab="Linear tobit predictor X*beta",
ylab="Hours and predictors of hours",
main = "Different predictors for hours worked" )
abline(c(0,0),c(1,1), col="red", lwd=2.5)
curve(tobitPredict(x, outtobit), from = -3000, to = 2200, n = 150, add = TRUE,
type = "l", col="blue", lwd = 4.5)
abline(olsout2, col="purple", lwd =2)
abline(0,0)
curve(tobitPredict0(x, outtobit), from = -3000, to = 2200, n = 150, add = TRUE,
type = "l", col="cornflowerblue", lwd = 3)
text(-2500, 640, "E(y|y>0)")
text(-600, -950, "E(y*)")
text(-1100, 280, "E(y|y≥0) = E(y)")
text(-1950, -690, "OLS")
10.5 Incidental truncation
Until now, we dealed with the problem of truncated and censored data which is directly linked to the dependent variable y and lead to a sample that is not random any more; this could renders OLS estimates inconsistent.
-
Remember, sample selection poses no problem if it is solely linked to exogenous variables or is exogenous to the errors 3
- To formally show this result, define the indicator s_i, which is one if the observation i is in the sample and otherwise is zero. Then we have for s_i=1 (for fully observed observations, including \mathbf x_i)
s_i y_i \, = \, s_i \mathbf x_i \boldsymbol \beta + s_i u_i \ \ \Rightarrow
E(s_i y_i|\mathbf x_i, s_i=1) \, = \, E(s_i \mathbf x_i \boldsymbol \beta| \mathbf x_i, s_i=1) + E(s_i u_i|\mathbf x_i, s_i=1) \ \ \Rightarrow
E(y_i|\mathbf x_i, s_i=1) \, = \, \mathbf x_i \boldsymbol \beta + \underbrace {E(u_i|\mathbf x_i, s_i=1)}_{\text {should be } 0} \, = \, \mathbf x_i \boldsymbol \beta
-
So, if the expectation of u_i is not influenced by s, the conditional expectation of y is equal to \mathbf x_i \boldsymbol \beta and thus, \boldsymbol \beta can be consistently estimated by OLS with the available sample
-
The expectation of u_i is not influenced by s if
s is only a function of the exogenous \mathbf x (exogenous sample selection), in which case the condition E(u_i|\mathbf x_i, s_i=1)=0 reduces to the usual condition E(u_i|\mathbf x_i)=0
s is purely random (random sample selection), in which case the condition E(u_i|\mathbf x_i, s)=0 is met
s is a mixture of both cases above
-
Incidental truncation or sample selection is a mechanism where sample selection is only indirectly linked to the dependent variable y or its error u; that means via other variables.
Very often but not always, this is due to self-selection
-
The leading example is the following: We want to estimate the person’s market wage what she could command in the labor market. This market wage would depend on her education, experience, age, etc. But if the wage offered by the market is too low, meaning that the market wage is lower than her reservation wage, she will not work at all and her potential market wage is not observed
This is clearly a self-section mechanism and a first stage decision determines whether the wage for this particular person is observed or not. The sample is not random any more and in particular, is linked to the unobserved reservation wage
-
People with low potential market wage or high reservation wage might therefore be heavily underrepresented in the sample
- And if the factors deciding labor force participation influence the potential market wage as well we will get biased results regarding the factors determining the market wage
A solution to this problem is to define a separate model describing the decision for working versus non-working to get a clue whether a particular person will probably work and try to make an adjustment for the underrepresented cases
Hence, we will have two models, a selection or participation model and the outcome model (or sometimes called intensity model) which we are primarily interested in
10.5.1 Heckit model
The selection model (participation model) can be modeled as a simple Probit model; we want to estimate the conditions, from which on the person is expected to work (or more generally to participate). This is the case when the probability for working is larger than for non-working. Naturally, this leads to a probit model:
s_i^*=\mathbf z_i \boldsymbol \gamma + e_i, \ \ \ \ s_i = \begin{cases} 1 \ \text{ if } \ s_i^*>0 \\ 0 \ \text{ if } \ s_i ≤ 0 \end{cases} \tag{10.21}
Here, s_i^* is a latent variable, which in our example could be the difference between potential market wage and reservation wage
We predict labor market participation, s_i=1, if s_i^*>0, i.e., the market wage is larger than the reservation wage. Clearly, we predict out of labor force, s_i=0, if s_i^*≤0. In particular, s_i=1 if
e_i > - \mathbf z_i \boldsymbol \gamma \tag{10.22}
- The vector \mathbf z_i contains the explanatory variables for participation in the labor market (in our example the number of kids, income of husband, age, etc.), \boldsymbol \gamma is the corresponding parameter vector to be estimated and e_i represents unobserved heterogeneity of the utility functions and especially, information about the unobserved reservation wage
The equation of primary interest is the outcome equation (sometimes called intensity equation), in our example the wage equation:
y_i^* = \mathbf x_i \boldsymbol \beta + u_i, \ \ \ y_i = \begin{cases} y_i^* \ \text{ if } \ s_i=1 \\ \text{unobserved if } \ s_i = 0 \end{cases} \tag{10.23}
Here, y_i^* is the latent dependent variable, in our example the potential market wage, \mathbf x_i are the explanatory variables and u_i represents unobserved heterogeneity of the individuals which too affects the market wage
The observation rule says that y_i is only observed if s_i=1, i.e. s_i^*>0, which means participation in the labor market in our example. The value of s_i is determined in the first stage decision problem, the selection equation
Note, we assume that \mathbf z_i is observed for all i in the sample, as otherwise we couldn’t estimate the selection problem
According to the selection model (Equation 10.21), besides \mathbf z_i, the unobserved error e_i determines whether the market wage for person i is observed or not. Therefore, we have an incidental sample selection because these factor are not directly linked to y_i or u_i
We assume that the errors from the section and outcome equations, e and u, are jointly normal distributed with correlation \rho. This means that the joint distribution is solely described by the mean vector and the covariance matrix and thus rules out any nonlinear relationships between e and u:
\binom{e}{u} \ \sim \ N\left ( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 1 & \rho \, \sigma_u \\ \rho \, \sigma_u & \sigma_u^2 \end{bmatrix} \right ) \tag{10.24}
Note, \sigma_e^2 = 1 is no restriction of generality as this variance is not identified in a Probit model and thus can be arbitrarily normalized (we already discussed this point in the chapter on binary choice models)
We assume, that e_i as well as u_i are independent of (\mathbf x_i, \mathbf z_i), thus E(u_i | \mathbf x_i, \mathbf z_i)=0 and E(e_i | \mathbf x_i, \mathbf z_i)=0, MLR.4 (see Section 2.7)
Whether or not OLS applied to the wage equation leads to consistent and asymptotically normal distributed estimates of \boldsymbol \beta depends on whether the selection mechanism via e_i is exogenous (uncorrelated) with respect the error term u_i (and thus with y_i)
Therefore, the correlation coefficient \rho is crucial for this problem. If \rho \neq 0, a sample selection bias arises
To show the importance of \rho we decompose u_i into its conditional expectation on e_i and an error term \epsilon_i, which is uncorrelated with e_i by construction and uncorrelated with \mathbf x_i, \mathbf z_i by assumption (as u_i is independent of (\mathbf x_i, \mathbf z_i))
u_i \, = \, E(u_i|e_i) + \epsilon_i, \quad \text{with } \, E(\epsilon_i|\mathbf x_i, \mathbf z_, e_i)=0
-
Because (u_i,e_i) are jointly normal, E(u_i|e_i) is equal to the linear projection (regression) of u_i on e_i, with the projection coefficient \sigma_{u,e}/\sigma_e^2 (with \sigma_{u,e} = covariance of u and e)
- This reduces to \sigma_{u,e}, as \sigma_e^2 is normalized to one. Considering the definition of the correlation coefficient, \rho := \sigma_{u,e}/(\sigma_u \sigma_e), we get
u_i \, = \, \sigma_{u,e}e_i + \epsilon_i \, = \, \rho \, \sigma_u e_i + \epsilon_i
- Therefore, the outcome Equation 10.23 can be rewritten as
y_i^* = \mathbf x_i \boldsymbol \beta + \rho \, \sigma_u e_i + \epsilon_i \tag{10.25}
- Taking the conditional expectation of Equation 10.25 for the observed sample (s_i=1 \ \Rightarrow \ e_i>-\boldsymbol \gamma \mathbf z \,) we get (note, e_i \perp (\mathbf x_i, \mathbf z_i) by assumption)
E(y_i^*|\mathbf x_i, \mathbf z_i, s_i=1) \ = \ \mathbf x_i \boldsymbol \beta \, + \, \rho \sigma_u E(e_i|s_i=1) \, + \, \underbrace {E(\epsilon_i|\mathbf x_i, \mathbf z_i, \overbrace {s_i=1}^{e_i>-\boldsymbol \gamma \mathbf z})}_{= \,0, \text{ as } \epsilon_i \, \perp \, (\mathbf x_i, \mathbf z_i,e_i) }
According to the selection model (Equation 10.21), the term before the last is:
\rho \sigma_u E(e_i|s_i=1) \, = \, \rho \sigma_u E(e_i|s_i^*>0) \, = \, \rho \sigma_u E(e_i|e_i>-\mathbf z_i \boldsymbol \gamma)
-
But this is the expectation of a truncated standardized normal distribution and hence, we can apply the theorem about truncated normal distributions, Theorem 10.1 (with y=e, \mu=0, c=-\mathbf z_i \boldsymbol \gamma and \sigma_{(e)}=1)
\rho \sigma_u E(e_i|e_i>-\mathbf z_i \boldsymbol \gamma) \, = \, \rho \sigma_u \lambda(\mathbf z_i \boldsymbol \gamma)
with \lambda(\mathbf z_i \boldsymbol \gamma) being the inverse Mills Ratio (see definition in Equation 10.7)
Putting things together, we finally arrive to the conditional expectation of observed y_i, which is the population regression in this case:
E(y_i|\mathbf x_i, \mathbf z_i, s_i=1) \, = \, \mathbf x_i \boldsymbol \beta + \rho \sigma_u \lambda(\mathbf z_i \boldsymbol \gamma) \tag{10.26}
- Note, the similarities to Equation 10.8, which is not accidental as this too is a case of a truncated regression problem
Regarding to our leading example, if the correlation of unobserved wage determinants and unobserved determinants of the work decision is positive (\rho>0), the women observed working will have higher wages than expected solely from their characteristics (= positive selection). And this effect is represented by \lambda(\mathbf z_i \boldsymbol \gamma), which is strictly positive
- Furthermore, the correction factor \lambda(\mathbf z_i \boldsymbol \gamma) will be higher for observations with a low probability to be in the sample (because of a low \mathbf z_i \boldsymbol \gamma, \ \lambda'< 0), but are nonetheless in the sample because of a high e_i, which represents unobserved heterogeneity and is possibly correlates with error u_i
As Equation 10.26 show, if \rho \neq 0, regressing y_i solely on \mathbf x_i leads to a sample selection bias in form of an omitted variable bias; the omitted variable \lambda(\mathbf z_i \boldsymbol \gamma ) is a nonlinear function of \mathbf z_i and therefore probably correlated with \mathbf x_i (as \mathbf x_i is usually a part of \mathbf z_i)
- In this case, there are unobserved factors affecting both the selection as well as the outcome equation. This is the cause of this type of sample selection bias
Finally, let us discuss another aspect more closely by means of our example. We argued that s_i^* in the selection equation could be interpreted as the difference between between market wage offers and reservation wage, (w_i - w_i^r). But according to the outcome Equation 10.23, the market wage offers are dependent on \mathbf x_i and on u_i. This implies two things
Thus, the variables in the outcome equation \mathbf x_i obviously plays some role in the selection equation (via market wage offers). Therefore, the variables \mathbf x_i should be part of the variables in \mathbf z_i, the variables of the selection equation. And this inclusion of all \mathbf x in \mathbf z is generally recommended as standard procedure
A very similar argument can be put forward for the unobserved factors of the outcome equation, u_i. These too influence the market wage offers and therefore the decision to work or not. Thus, these factors are also a part of e_i, the unobserved factors of the selection equation. Hence, the error terms u_i and e_i are expected to be positively correlated in our example, which implies \rho > 0
Estimation
The model introduced above is called Hackman model or Heckit, Heckman (1979). It is a generalization of the Tobit model. In fact, if you set \mathbf x_i = \mathbf z_i, u_i = e_i and censoring y_i to zero for cases not observing y_i, we arrive to the Tobit model. For this reason the Hackman model is also sometimes called Tobit II model.
-
There are two procedures to estimate the model, a 2-step procedure and maximum likelihood estimation. Surprisingly, the 2-step procedure is much more common in practice. This procedure is as follows
- Estimate the selection model (Equation 10.21) by probit. This determines if the dependent variable is expected to be observed. The probability for y_i to be observed is
P(s_i=1|\mathbf z_i) = \Phi(\mathbf z_i \boldsymbol \gamma) \ \rightarrow \ \hat {\boldsymbol \gamma}
- Calculate the inverse Mills ratio using the estimated probit coefficients for every observation
\hat\lambda({\mathbf z}_i \hat{\boldsymbol \gamma}) = \dfrac {\phi({\mathbf z}_i \hat{\boldsymbol \gamma})}{\Phi({\mathbf z}_i \hat{\boldsymbol \gamma})}
- Estimate the outcome Equation 10.23, augmented by the estimated inverse Mills Ratio as a correction factor, by OLS. This yields consistent estimates of \boldsymbol \beta. The coefficient \alpha is an estimate for \rho \sigma_u of Equation 10.26
y_i \, = \, \mathbf x_i \boldsymbol \beta \, + \, \alpha \, \hat\lambda(\mathbf z_i \hat{\boldsymbol \gamma}) \, + \, error \tag{10.27}
- A complication arises with the calculation of correct standard errors for regression Equation 10.27. Besides an inherent heteroscedasticity of the error, one have to take into account that \hat \lambda is itself estimated with the help of the estimated parameters \hat {\boldsymbol \gamma} of the selection equation, which introduces additional variation and correlations across observations. For details see Green (2017), chapt. 19, p. 994
The whole procedure sounds a little bit complicated, but there are standard routines for both STATA and R. The corresponding R package is sampleSelection
with the procedure heckit.
These routines also calculate the correct standard errors for the 2-step procedure.
The estimated coefficient \hat \alpha of \hat \lambda in Equation 10.27 can be used to test for a sample selection bias. If this parameter is not statically significant different from zero, then there seems to be no sample selection bias as \hat \alpha is an estimate for \rho \sigma_u hence, \hat \alpha = 0 \Rightarrow \rho = 0
However, this whole 2-step procedure and testing for a sample selection should be carried out with great care. The reason is that there is an inherent identification problem; the variable \lambda is a nonlinear function (but over a wide range nearly a linear function) of the variables in \mathbf z. As we learned above, \mathbf x is generally contained in \mathbf z. Thus, we often have a server multicollinearity problem in the estimated outcome Equation 10.27, leading to very imprecise parameter estimates and powerless tests
To reduce the identification problem, we always should have variables in \mathbf z, which are not contained in \mathbf x, helping to identify the coefficients of \mathbf x in Equation 10.27 – in this case, \hat \lambda is not any more a nearly linear combination of the variables in \mathbf x
In practice, it is often hard to find variables that play a role in the selection model but credibly do not play a role in the outcome model. Erroneously omitting variables form the outcome model can lead to severe biases
Example
We argued, that \mathbf z should contain at least one variable which is not contained in \mathbf x. Let us discuss this aspect by means of our example, which is estimated below.
We are interested in the market wage offers of married woman. We assume that education and experience play an important role in this respect. So, this is our outcome equation
To avoid a sample selection problem (because we only observe wages of employed woman), we apply Heckman’s 2-step procedure. Thus, we additionally specify a selection equation for the decision to work, to estimate the correction factor \lambda. We argued above, that all variables of the outcome equation should be included in the selection equation because possible wage offers are clearly an important factor for the decision to work as well
-
Therefore, the selection model contains at least education and experience as explanatory variables. However, to better identify the parameters of the outcome equation, we additionally need variables that are important for the decision to work and do not play a role for the market wage offers
- In this case, we find such variables; we added the non-wife income (income of the husband), the number of children under six and over six and age as explanatory variables to the selection model. These variables should have an effect on the reservation wage and therefore on the decision to work, but not on the wage offers in the market
library(sampleSelection)
library(wooldridge)
data("mroz")
# OLS estimation
ols <- lm(log(wage) ~ educ + exper + I(exper^2), data = mroz)
# Heckit 2 Step model (default); the first line is the selection equation,
# the second line the outcome equation
heck <- heckit(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
log(wage) ~ educ + exper + I(exper^2),
data = mroz)
summary (heck)
--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
753 observations (325 censored and 428 observed)
15 free parameters (df = 739)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.270077 0.508593 0.531 0.59556
nwifeinc -0.012024 0.004840 -2.484 0.01320 *
educ 0.130905 0.025254 5.183 2.81e-07 ***
exper 0.123348 0.018716 6.590 8.34e-11 ***
I(exper^2) -0.001887 0.000600 -3.145 0.00173 **
age -0.052853 0.008477 -6.235 7.61e-10 ***
kidslt6 -0.868328 0.118522 -7.326 6.21e-13 ***
kidsge6 0.036005 0.043477 0.828 0.40786
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5781032 0.3050062 -1.895 0.05843 .
educ 0.1090655 0.0155230 7.026 4.83e-12 ***
exper 0.0438873 0.0162611 2.699 0.00712 **
I(exper^2) -0.0008591 0.0004389 -1.957 0.05068 .
Multiple R-Squared:0.1569, Adjusted R-Squared:0.149
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio 0.03226 0.13362 0.241 0.809
sigma 0.66363 NA NA NA
rho 0.04861 NA NA NA
--------------------------------------------
Code
=========================================================
OLS Heckit 2Stp Heckit MLE
---------------------------------------------------------
(Intercept) -0.5220 ** -0.5781 -0.5527 *
(0.1986) (0.3050) (0.2604)
educ 0.1075 *** 0.1091 *** 0.1084 ***
(0.0141) (0.0155) (0.0149)
exper 0.0416 ** 0.0439 ** 0.0428 **
(0.0132) (0.0163) (0.0149)
exper^2 -0.0008 * -0.0009 -0.0008 *
(0.0004) (0.0004) (0.0004)
invMillsRatio 0.0323
(0.1336)
sigma 0.6636 0.6634 ***
(0.0227)
rho 0.0486 0.0266
(0.1471)
---------------------------------------------------------
Adj. R^2 0.1509 0.1490
Num. obs. 428 753 753
Censored 325 325
Observed 428 428
AIC 1693.7702
BIC 1758.5071
Log Likelihood -832.8851
=========================================================
*** p < 0.001; ** p < 0.01; * p < 0.05
In our example, the coefficient of \lambda, the inverse Mills Ratio, is not significant. Thus, there is no significant sample selection bias. This is the reason, why the estimates of OLS and Heckit are very similar
In the table, the results of a maximum likelihood estimation are also reported. As mentioned above, the 2-step procedure is much more common in empirical work, although the MLE is more efficient, especially with regard to an estimate of \rho. The downside of MLE is that this is said to be more prone to specification errors, for instance with regard to the normal distribution assumption of the error terms
-
The likelihood function of the Heckman model consists of two parts: P(s_i=0), this is the Probit part, and the density of y_i, given s_i=1 times P(s_i=1)
\ln L = \sum_{s_i=0} \ln (1 - \Phi(\mathbf z_i \boldsymbol \gamma)) + \sum_{s_i=1} \ln f(y_i|s_i=1) \Phi(\mathbf z_i \boldsymbol \gamma)
As we have an incidental truncation, f(y_i|s_i=1) = f(y_i|e_i>-\mathbf z_i \boldsymbol \gamma) is a more complicated term based on a bivariate truncated normal distribution (see Green (2017), chapt.19, p. 995)
Last but not least, it should be noted that many of the methods for limited depended variables and sample selection discussed can be extended to 2SLS and some also to panel data (for details see Green (2017))
For a truncation from the right (from above) we have: \lambda(x) =\dfrac {-\phi(x)}{\Phi(-x)}.↩︎
Hint: Let \alpha = (\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma}). Then \lambda'(\alpha)=\frac {-\alpha \phi(\alpha) \Phi(\alpha) - \phi(\alpha)^2}{\Phi(\alpha)^2} = -\alpha \lambda(\alpha) - \lambda(\alpha)^2 = (-) and one can show that: -1<\lambda'(\alpha)<0.↩︎
The following results carry over to IV-estimations as well, if E(u|\mathbf z, s)=0, with \mathbf z being the (external + internal) instruments.↩︎