Mainly based on Wooldridge (2019), Chapter 17 and Green (2017)

Limited dependent variables (LDV) are variables which are substantively restricted in their domain, for instance:
  • Binary variables, e.g., employed/not employed

  • Non-negative variables, e.g., wages, prices, interest rates

    • Non-negative variables with excess zeros, e.g., labor supply
  • Count variables, e.g., the number of arrests in a year

  • Censored variables, e.g., unemployment duration

We further have

Sample selection models
  • The sample used to infer population relationships is endogenously selected, e.g., a wage offer regression but wage data are only observable if people are actually working

    • The resulting sample is not random any more and could lead to biased estimates

10.1 Binary dependent variable

10.1.1 The linear probability model (LPM)

Until now, we only analysed models with dummy variables on the right hand side – as explanatory variables.

Now, we analyze models with a binary dependent variable, i.e., a dummy variable as left hand side variable.

  • For instance, y may represents labor force participation of married women, which is one if the woman is in the labor force and is zero if not. If we assume a linear model we have

y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k + u

Under MLR.4, we have as usual (population regression line):

\Rightarrow \ E(y \mid \mathbf x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k = \, \mathbf x \boldsymbol \beta

As the dependent variable only takes on the values 0 and 1 we additionally have in this case:

E(y \mid \mathbf x) := \, 1 \cdot P(y=1 \mid \mathbf x) + 0 \cdot P(y=0 \mid \mathbf x)

\Rightarrow \ P(y=1 \mid \mathbf x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k \tag{10.1}

This is the linear probability model, LPM: The model estimates the probability that y=1; in our example that a married woman is part of the labor force.

  • The coefficient \beta_j describes the partial effect of the explanatory variable j on the probability that y=1

\beta_j = \dfrac{\partial P(y=1\mid \mathbf x)}{\partial x_j}

  • This model was already studied in some detail in Section 5.8. You should look there especially for the estimated example

Code
# LPM for labor forece participation of married woman
outm <- lm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)
Code
# We calculate squared residuals and fitted values for plotting
ressq <-  outm$residuals^2
yhat <- outm$fitted.values

Measures of fit for binary dependent variables

  • A popular measure of fit for such models is the hit rate, which measures the portion of correctly predicted observations of the dependent variable (which has only zeros or ones)

    • Thereby, whenever \hat y_i (which is the estimated probability that y_i=1) is larger than the cutoff 0.5, the model predicts y_i=1 and when \hat y_i <0.5, the model predicts y_i=0
  • Besides the overall hit rate, one can separately examine, how the hit rates are for values of y_i=1 and y_i=0

    • The portion of correctly predicted y_is, when y_i=1, is called sensitivity. This is the ability of the model to detect true “positives” (e.g., the percentage of people predicted to be in the workforce and actually are or, as an example from medicine,

      • the percentage of sick people who are correctly identified by a tumor marker as having the condition)
    • The portion of correctly predicted y_is, when y_i=0, is called specificity. This is the ability of the model to detect true “negatives” (e.g., the percentage of people predicted not to be in the workforce and actually are not or, as an example from medicine,

      • the percentage of healthy people who are correctly identified by a tumor marker as not having the condition)
  • Below, we calculate the hit rates described above for the a LPM, and for purposes of comparison, for a corresponding Probit model (discussed below). The results are nearly identical – (the code for this hit-rate function is listed below and could by used by any user)


Code
hitrate <- function(outm, cutoff=0.5, stragaz=TRUE) {
  
  # This function calculates overall hit rates, sensitivity and specificity for
  # LPM, Logit and Probit models
  # outm = output list of `lm()` or `glm()`
  # cutoff = threshold value for predicting 1, default = 0.5
  
  #requires package stargazer
  
  
  # load dependent variable of model
  y <- outm$model[[1]]

  # load fitted values
  yhat <- outm$fitted.values
  ly <- length(y)
  
  # prediction of one, if yhat > threshold
  ypred <- (yhat > cutoff)*1
  
  
  # y which are 1
  y_1 <- y[y==1]
  ly_1 <- length(y_1)

  # y which are zero
  y_0 <- y[y==0]
  ly_0 <- length(y_0)
  
  # prediction error
  prederror <- y - ypred
  
  # overall hits
  hit_o <- rep(0, ly)
  hit_o <- ifelse(  (ypred == 1 & y == 1) | (ypred == 0 & y == 0) ,1, 0)
  hitrate_o <- sum(hit_o)/ly
  
  # correct prediction if y=1
  hit_1 <- rep(0, ly)
  hit_1 <- ifelse((ypred == 1 & y == 1), 1, 0)
  hitrate_1 <- sum(hit_1)/ly_1
  
  # correct prediction if y=0
  hit_0 <- rep(0, ly)
  hit_0 <- ifelse((ypred == 0 & y == 0), 1, 0)
  hitrate_0 <- sum(hit_0)/ly_0
  
  # storing results in data frame
  results <- data.frame(HitRate = hitrate_o, Sensitivity = hitrate_1, Specificity = hitrate_0, 
                        Nof1s = ly_1, Nof0s = ly_0, check.rows=TRUE)
  
  if (stragaz==TRUE) {
    stargazer::stargazer(results, type = "text", summary = FALSE, rownames = FALSE)
  }
  
  return(results)
}

# Hit rate of LPM
hr_lpm <- hitrate(outm)

===========================================
HitRate Sensitivity Specificity Nof1s Nof0s
-------------------------------------------
0.734      0.818       0.625     428   325 
-------------------------------------------
# For comparison, estimate a corresponding Probit model 
# which we introduce below 

outprobit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, 
                 data = mroz, family = binomial(link = "probit"))

# Hit rate of Probit model
hr_probit <- hitrate(outprobit)

===========================================
HitRate Sensitivity Specificity Nof1s Nof0s
-------------------------------------------
0.734      0.813       0.631     428   325 
-------------------------------------------

10.1.2 Probit and Logit models

The main problems of the Linear Probability Model (due to the linearity in \mathbf x) are

  • Predictions sometimes lie outside the unit interval
  • Partial effects of explanatory variables are constant

An obvious solution for these problems is a nonlinear model for binary response

  • We assume that the expected response – the response probability – is a nonlinear continuous function G(\mathbf x) of the explanatory variables, with G'>0, G(-\infty) \rightarrow 0 and G(\infty) \rightarrow 1

E(y|\mathbf x) \, =\, P(y=1 | \mathbf{x}) \, = \, G\left(\beta_{0}+\beta_{1} x_{1}+\ldots+\beta_{k} x_{k}\right) \, = \, G(\mathbf{x} \boldsymbol{\beta}) \

  • G(\mathbf x) is the (inverse) link function which links the variables \mathbf x to the probability that y_i = 1

  • Obvious candidates for the function G() are cumulative distribution functions (cdf), which are increasing and map [-\infty, +\infty] into the unit interval [0,1]

The estimation model for the observable variable y therefore is

y \, = \, G(\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k) + u


  • The error term u of this model has the same properties as the one in the LPM, especially the errors are heteroscedastic with \operatorname{Var}(u_i | \mathbf x) = G(\mathbf x \boldsymbol \beta) (1 - G(\mathbf x \boldsymbol \beta)), (see Section 5.8)

  • Conditional on \mathbf x, the observations y_i are Bernoulli distributed

  • Regarding the choice of the link function there are two main models

    • Probit model with the (inverse) link function: (with \phi(z) being the probability density and \Phi(z) being the cumulative distribution function (cdf) of the standard normally distribution, see Equation A.3 and Equation A.4)

    G(z) \ = \ \Phi(z) := \, \int_{-\infty}^z \phi(v)dv \ = \, \int_{-\infty}^{\mathbf x \boldsymbol \beta} \phi(v)dv \ \ \quad \text{ :cdf of standard normal distribution}

    • Logit model with the (inverse) link function:

    G(z) \ = \ \Lambda(z) := \, \dfrac {e^z}{1 + e^z} \ = \ \dfrac {e^{\mathbf x \boldsymbol \beta}}{1 + e^{\mathbf x \boldsymbol \beta}} \ \ \ \quad \text{ :cdf of standard Logistic distribution}

  • Both distributions are symmetric and have quite similar shapes. However, the Logistic distribution has fatter tails - it’s nearly identical to a t(8)-distribution; therefore, outliers are somewhat more probable rendering the logistic estimates a little bit more robust against outliers


Latent variable interpretation

  • Both the Logit and Probit model have a latent variable formulation/interpretation

y^* = \mathbf x \boldsymbol \beta + e, \quad \text{ and } \ \ y_i = \begin{cases} 1 \ \text{ if } \ y_i^*>0 \\ 0 \ \text{ if } \ y_i^*\leq 0 \end{cases}

  • e \sim N(0,1) with the Probit model; and e \sim Logistic(0,\pi^2/3) with the Logit model

  • If the latent variable y^* is larger than zero, the observable y takes on the value 1, if it is less or equal zero, y takes on 0 (y^* can thus be interpreted as the propensity to have y = 1)

    • The latent variable is often interpreted as net utility for doing something – Random Utility models. If this net utility is higher as some threshold (0 in our case), then the utility is higher than the opportunity costs (utility of other alternatives) and the action is taken, i.e. y=1
  • Note that there is rarely a well defined unit of measurement for y^*. Therefore, identification requires a scaling, which is done by setting the variance of e to one, respectively to \pi^2/3

  • We do not observe the utility of the action – y^* is a latent variable – we only observe whether the action is taken or not, y=1 or y=0. But this is the case if y^* > or \leq 0. In the following it is shown that this model of discrete choice leads to our estimation model


  • From the latent variable model it follows:

E(y \,| \, \mathbf x) \ = \ P(y = 1\,| \,\mathbf x) \ = \ P(y^* > 0 \,| \,\mathbf x) \ = \ P(\mathbf x \boldsymbol \beta + e > 0\,| \,\mathbf x) \ = \ P(e > -\mathbf x \boldsymbol \beta \,| \,\mathbf x) \ \ =

1 - G(-\mathbf x \boldsymbol \beta) \ \ = \text { (by symmetry) } \

G(\mathbf x \boldsymbol \beta) \ = \begin{cases} \Phi(\mathbf x \boldsymbol \beta) \ \text{ :Probit model } \\ \Lambda(\mathbf x \boldsymbol \beta) \ \text{ :Logit model } \end{cases}

  • Hence, the latent variable model serves as a theoretical foundation of our nonlinear estimation model, postulated at the begin of this section. We have shown that:

E(y \,| \, \mathbf x) \, = \, G(\mathbf x \boldsymbol \beta)

  • This leads to the estimation model

y \, = \, G(\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k) + u

  • Note, neither a threshold of 0 in the latent variable model nor \operatorname{Var}(e) = 1 or \pi^2/3 in the latent variable model is a restriction of generality. The former is reflected by the intercept, the latter is only a scaling factor of the variables. So, nether the threshold nor \operatorname{Var}(e) are of particular value; they are not identified and hence free to choose

10.1.3 The Gumbel model

  • Sometimes, a non-symmetric distribution seems more appropriate, for instance if the number of ones (success) and zeros (unsuccess) are very different in the sample. In this case the Gumbel distribution (an extreme value distribution) can be used as an interesting alternative link function G(z) \ =\, {e^{-e^{-z}}} \ = {e^{-e^{-\mathbf x \boldsymbol \beta}}} \ \ \ \ \quad \text{ :cdf of standard Gumbel distribution}

  • This is a right skewed distribution, which might be appropriate if the ones are the rare (extreme) outcomes. If the zeros are the rare outcomes one could use a left skewed distribution, the (mirrored) Complementary Gumbel distribution: 1-G(-z) \ =\, 1-e^{-e^{z}} \ = 1-e^{-e^{\mathbf x \boldsymbol \beta}} \ \ \ \ \quad \text{ :cdf of Complementary Gumbel distribution}

  • Interestingly, only this version is implemented with the R procedure glm(). So, if you want to use this left skewed distribution and the ones are the rare outcomes in your sample, you have to recode the y-s with the command: “ifelse(dataset$varname_of_y == 1, 0, 1)”, making ones to zeros and zeros to ones


10.1.4 LPM versus Probit

To show graphically the difference between the LPM and the Probit (or Logit) model we estimate the probability of denial of a bank credit (deny) in dependence of the ratio of monthly dept payments to monthly income (pirat)

library(visreg)  # for plotting partial effect
library(wooldridge); 
data("HMDA") # data set from of the wooldridge library

# convert `deny` to numeric. This is coded as a factor variable in the data set 
HMDA$deny <- as.numeric(HMDA$deny) - 1
# estimating linear probability model
denylpm <- lm(deny ~ pirat, data = HMDA)

To estimate the probit model, we use the gml() function in R - glm for generalized linear models. There are two additional option to set in this function:

  • family for the distribution of y; this is binomial in our case (binary dependent variable)

  • and you have to specify the link function G(\mathbf x \boldsymbol \beta) for the probit model

# estimating probit model
denyprobit <- glm(deny ~ pirat, family = binomial(link = "probit"), data = HMDA)

Code
visreg(denylpm, band = FALSE, ylim=c(-0.05, 1.05), main = "LPM", xlab = "P/I ratio")
abline(0,0, lty = 2, col = "darkred");  abline(1,0, lty = 2, col = "darkred")
text(2.5, 0.95, cex = 0.8, "Mortgage denied");  text(2.5, 0.05, cex= 0.8, "Mortgage approved")
points(deny~pirat, data = HMDA, pch = 20, cex.main = 0.85)

Figure 10.1: Estimated probability of denying a bank credit with a LPM

Code
visreg(denyprobit, scale = "response", band = FALSE, ylim=c(-0.05, 1.05), main = "Probit", xlab = "P/I ratio", rug=0)
abline(0,0, lty = 2, col = "darkred");  abline(1,0, lty = 2, col = "darkred")
text(2.5, 0.95, cex = 0.8, "Mortgage denied");  text(2.5, 0.05, cex= 0.8, "Mortgage approved")
points(deny~pirat, data = HMDA, pch = 20, cex.main = 0.85)

Figure 10.2: Estimated probability of denying a bank credit with a Probit model

10.1.5 Partial Effects

  • The two previous figures show that the predicted probabilities in probit model are always within the unit interval, contrary to the LMP

  • Furthermore, the partial effects of the explanatory variables are not constant any more (as in the LPM), but are nonlinear and depend on the values of \mathbf x

    • On the one hand, this is a big theoretical advantage as linear partial effects are not possible in the context of predicting probabilities

    • On the other hand, this is a practical disadvantage as the estimated coefficients \hat\beta are not easily interpretable any more. Basically, you have to relay on average partial effects, APE.

  • The partial effect of the explanatory variable x_j on the probability that y=1 is

\dfrac{\partial P(y=1| \mathbf x)}{\partial x_j} \ = \ \dfrac {\partial G(\mathbf x \boldsymbol \beta)}{\partial x_j} \ = \begin{cases} \dfrac {\Phi(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \phi(\mathbf x \boldsymbol \beta) \, \beta_j \ \text{ : Probit } \\ \dfrac {\Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \Lambda(\mathbf x \boldsymbol \beta) (1 - \Lambda(\mathbf x \boldsymbol \beta) ) \, \beta_j \ \text{ : Logit } \end{cases}


  • For the probit case, \phi is the density function of the normal distribution and we have

\dfrac {\partial \Phi(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \, \dfrac {\partial}{\partial \beta_j} \int_{-\infty}^{\mathbf x \boldsymbol \beta} \phi(v)dv \ = \ \phi(\mathbf x \boldsymbol \beta) \beta_j

  • For the logit case, applying the Quotient rule, we have

\dfrac {\partial \Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \dfrac {\partial}{\partial \beta_j} \left( \dfrac {e^{\mathbf x \boldsymbol \beta}}{1 + e^{\mathbf x \boldsymbol \beta}} \right) \ = \ \dfrac {e^{\mathbf x \boldsymbol \beta}}{ (1 + e^{\mathbf x \boldsymbol \beta})^2} \, \beta_j

After some manipulations, we eventually get

\dfrac {\partial \Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \Lambda'(\mathbf x \boldsymbol \beta) \, \beta_j \ = \ \Lambda(\mathbf x \boldsymbol \beta) \, (1 - \Lambda(\mathbf x \boldsymbol \beta) ) \, \beta_j

  • In both cases the partial effect of x_j on the P(y=1 | \mathbf x) is not equal to \beta_j (which is partial effect on the latent variable y^*), but \beta_j must be multiplied by the corresponding density function, evaluated at \, \mathbf x \boldsymbol \beta, i.e., at particular values of \mathbf x

  • Hence, the partial effects are nonlinear and depend on the level of \mathbf x

    • Remark: The formulas for the partial effects just derived are only valid for continuous or nearly continuous variables

  • One can calculate APE (average partial effects) for instance with the R package marginaleffects, or PEA (partial effects at average)

    • The former is the partial effect calculated for each observation \mathbf x_i and than averaged. For the Probit model this is: \widehat{A P E}_{j} =\left( \frac{1}{n} \sum_{i=1}^{n} \phi\left(\mathbf{x}_{i} \hat{\boldsymbol{\beta}}\right) \right) \hat\beta_{j}

    • The latter is calculated at the mean of \mathbf x, \bar {\mathbf x}. For the Logit model this is \, \widehat{P E A_{j}} = \Lambda' (\bar{\mathbf{x}} \hat{\boldsymbol{\beta}}) \hat\beta_{j}

    • PEAs could be somewhat problematic in the case of dummy variables. For instance, if you have a dummy variable for gender in your model, averaging to 0.5 might be questionable

  • For a discrete explanatory variable x_k, e.g., the number of years of education or gender, the APE can directly be computed (for the Logit model replace \Phi by \Lambda)

\widehat{A P E}_{k} \ = \ n^{-1} \sum_{i=1}^n \left( \Phi\left[\beta_{0}+\beta_{1} x_{i1}+\ldots+\beta_{k}\left(c_{ik}+1\right) \right] \ - \ \Phi \left[\beta_{0}+\beta_{1} x_{i1}+\ldots+\beta_{k} c_{ik} \right] \right)

The R package marginaleffects apply this formula automatically to factor variables


10.1.6 Estimation

  • As both models are highly nonlinear, OLS can not be applied. In principle, NLS (nonlinear last squares) could be applied, but because of the heteroscedasticity of the error terms this would not be efficient

  • Maximum Likelihood estimation (MLE) is the main procedure

    • Thereby, we are looking for coefficients \boldsymbol \beta so that the observed sample has the highest probability

    • Each observation i is treated as a single and independent draw from a Bernoulli distribution (binominal with one draw)

    • We have a success probability for observation i: \text{ } (remember: G=\Phi for probit and G=\Lambda for logit) P(y_i=1 \, | \, \mathbf x; \boldsymbol \beta) = G(\mathbf x_i \boldsymbol \beta)

    • and the non-success probability for observation i:
      P(y_i=0 \, | \, \mathbf x; \boldsymbol \beta) = 1-G(\mathbf x_i \boldsymbol \beta)

  • This leads to the joint probability function (for all observations)

P(y_1, \ldots, y_n \, | \, \mathbf x; \boldsymbol \beta) \ = \ \underset {y=0} {\prod_{i=1}^n} \left[ 1-G(\mathbf x_i \boldsymbol \beta) \right] \, \underset {y=1} {\prod_{i=1}^n} G(\mathbf x_i \boldsymbol \beta)


  • This can be conveniently rewritten to the likelihood function of the sample for the n observations

L(\boldsymbol \beta \ | \ \text{data}) \ = \ \prod_{i=1}^n \underbrace {\left[ G(\mathbf x_i \boldsymbol \beta) \right]^{y_i} \, \left[1-G(\mathbf x_i \boldsymbol \beta) \right]^{1-y_i}}_{= f(y_i | \mathbf x, \boldsymbol \beta), \, \text{ conditional density of } y_i}

  • (Note, if y_i=0 then 1-y_i=1 and if y_i=1 then 1-y_i=0)

  • Taking logs we get the log-likelihood function

\ln L(\boldsymbol \beta \ | \ \text{data}) \ = \ \sum_{i=1}^n \underbrace {\left\lbrace y_i \ln \left[ G(\mathbf x_i \boldsymbol \beta) \right] \, + \, (1-y_i) \ln \left[1-G(\mathbf x_i \boldsymbol \beta) \right] \right \rbrace}_{\ln f(y_i | \mathbf x, \boldsymbol \beta)}

  • We maximize \ln L(\boldsymbol \beta \, | \, \text{data}) with respect to \boldsymbol \beta such that the observed sample has the highest probability. This leads to the likelihood equations (the counterparts to the normal equations of OLS)

\dfrac { \partial \ln L}{\partial \boldsymbol \beta} \ = \ \sum_{i=1}^n \left[ y_i \dfrac{g(\mathbf x_i \boldsymbol \beta)}{G(\mathbf x_i \boldsymbol \beta)} \ + \ (1-y_i) \dfrac{-g(\mathbf x_i \boldsymbol \beta)}{(1-G(\mathbf x_i \boldsymbol \beta)} \right] \mathbf x_i \ = \ \mathbf 0, \quad \text{ with } g := G'


  • After some manipulations, for the Logit case, these likelihood equation reduce to

\dfrac { \partial \ln L}{\partial \boldsymbol \beta} \ = \ \sum_{i=1}^n [ y_i - \Lambda(\mathbf x_i \boldsymbol \beta) ] \mathbf x_i \ = \ \mathbf 0 \

  • Noting that [y_i - \Lambda(\mathbf x_i \boldsymbol \beta)] are the residuals of the model, the similarities to the OLS normal equations are obvious (compare Equation 1.9)

  • If \boldsymbol \beta contains an intercept, the first likelihood equation is

    \dfrac { \partial \ln L}{\partial \beta_0} \ = \ \sum_{i=1}^n [ y_i - \Lambda(\mathbf x_i \boldsymbol \beta) ] \mathbf \ = \ 0 \ \ \ \Rightarrow \ \ \ \frac{1}{n} \sum_{i=1}^n y_i = \frac{1}{n} \sum_{i=1}^n \Lambda(\mathbf x_i \boldsymbol \beta)

    • Hence, portions of ones in the sample equals the the average of the predicted probabilities
  • Interestingly, the likelihood equations for the Probit case don’t look so familiar, involving \phi/\Phi, the inverse Mills ratio

  • In both cases, the likelihood equations are highly nonlinear and have to be solved numerically, for instance with the Newton-Raphson algorithm


10.1.7 Example

  • We once more examine a model for the probability of labor force participation of married woman, inlf (for in labor force)

  • The explanatory variables are: nwifeinc (additional family income), educ, exper, exper^2, age, kidslt6 (number of children under 6 years) and kidsge6 (number of children over 6 years)

  • This time, we additionally estimate Logit and Probit models. Furthermore, we used the R-package marginaleffects to calculate the implied APE

  • The results presented below are pretty similar for the three models

    • Note, you should only directly compare the APE

    • As shown previously, the coefficients of Logit and Probit models are non-linear transformations of the partial effects

      • The Logit coefficients are typically 1.6 to 1.8 times larger than the Probit coefficients
      • Probit coefficients are typically about 3.3 times larger than the ones of the LPM
    • Fortunately, the R-package marginaleffects is doing the work for us and calculates the appropriate APE

  • All estimated coefficients have the expected sign

    • Once more, of special interest is the variable kidslt6, the number of children under 6 years. The estimated APE is about -0.26 meaning that, for an average woman, an additional young child reduces the probability to be in the labor force by 26%
  • But for Logit and Probit models, this partial effect depends in a nonlinear manner on the values of the explanatory variables. This is a main advantage compared to the LPM. The following figures show these nonlinear effects of the number of children under 6 on labor participation


library(wooldridge) 
library(marginaleffects)
library(modelsummary)

data("mroz")

# LPM model 
outlpm <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)

#Logit model
outlogit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, 
                family = binomial(link = "logit"))

# Probit model
outprobit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, 
                 family = binomial(link = "probit"))

# Average partial effects with avg_slopes() from marginaleffects package
marg_lpm <- avg_slopes(outlpm)
marg_logit <- avg_slopes(outlogit)
marg_probit <- avg_slopes(outprobit)

summary(outlogit)
      
      Call:
      glm(formula = inlf ~ nwifeinc + educ + exper + I(exper^2) + age + 
          kidslt6 + kidsge6, family = binomial(link = "logit"), data = mroz)
      
      Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
      (Intercept)  0.425452   0.860365   0.495  0.62095    
      nwifeinc    -0.021345   0.008421  -2.535  0.01126 *  
      educ         0.221170   0.043439   5.091 3.55e-07 ***
      exper        0.205870   0.032057   6.422 1.34e-10 ***
      I(exper^2)  -0.003154   0.001016  -3.104  0.00191 ** 
      age         -0.088024   0.014573  -6.040 1.54e-09 ***
      kidslt6     -1.443354   0.203583  -7.090 1.34e-12 ***
      kidsge6      0.060112   0.074789   0.804  0.42154    
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      (Dispersion parameter for binomial family taken to be 1)
      
          Null deviance: 1029.75  on 752  degrees of freedom
      Residual deviance:  803.53  on 745  degrees of freedom
      AIC: 819.53
      
      Number of Fisher Scoring iterations: 4
summary(outprobit)
      
      Call:
      glm(formula = inlf ~ nwifeinc + educ + exper + I(exper^2) + age + 
          kidslt6 + kidsge6, family = binomial(link = "probit"), data = mroz)
      
      Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
      (Intercept)  0.2700736  0.5080782   0.532  0.59503    
      nwifeinc    -0.0120236  0.0049392  -2.434  0.01492 *  
      educ         0.1309040  0.0253987   5.154 2.55e-07 ***
      exper        0.1233472  0.0187587   6.575 4.85e-11 ***
      I(exper^2)  -0.0018871  0.0005999  -3.145  0.00166 ** 
      age         -0.0528524  0.0084624  -6.246 4.22e-10 ***
      kidslt6     -0.8683247  0.1183773  -7.335 2.21e-13 ***
      kidsge6      0.0360056  0.0440303   0.818  0.41350    
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      (Dispersion parameter for binomial family taken to be 1)
      
          Null deviance: 1029.7  on 752  degrees of freedom
      Residual deviance:  802.6  on 745  degrees of freedom
      AIC: 818.6
      
      Number of Fisher Scoring iterations: 4
Code
# Modifications for modelsummery to print HitRates for glm()
library(broom)
# untereinander
glance_custom.glm <- function(x, ...) {
  HRate <- round( hitrate(x, stragaz=FALSE)$HitRate, 2 )
  Sens <-  round( hitrate(x, stragaz=FALSE)$Sensitivity, 2 )
  Spec <-  round( hitrate(x, stragaz=FALSE)$Specificity, 2 )
  Ratio <- round( hitrate(x, stragaz=FALSE)$Nof1s / hitrate(x, stragaz=FALSE)$Nof0s, 2 ) 
  out <- data.frame("HitRate" = HRate, 
                    "Sensitifity" = Sens,
                    "Specificity" = Spec,
                    "Ratio 1/0" = Ratio)
  return(out)
}
Code
modelsummary( list(" LPM "=outlpm, " LPM APE "=marg_lpm, " Logit "=outlogit, 
                   " Logit APE "=marg_logit, " Probit "=outprobit, 
                   " Probit APE "=marg_probit), 
              output="flextable", 
              gof_omit = "F", 
              align = "ldddddd", 
              stars = TRUE, 
              fmt = 4)
Table 10.1:

Probability for labor force participation of woman, several different models

LPM

LPM APE

Logit

Logit APE

Probit

Probit APE

(Intercept)

    0.5855***

    0.4255   

    0.2701   

   (0.1542)  

   (0.8604)  

   (0.5081)  

nwifeinc

   -0.0034*  

   -0.0034*  

   -0.0213*  

   -0.0038*  

   -0.0120*  

   -0.0036*  

   (0.0014)  

   (0.0014)  

   (0.0084)  

   (0.0015)  

   (0.0049)  

   (0.0015)  

educ

    0.0380***

    0.0380***

    0.2212***

    0.0395***

    0.1309***

    0.0394***

   (0.0074)  

   (0.0074)  

   (0.0434)  

   (0.0073)  

   (0.0254)  

   (0.0073)  

exper

    0.0395***

    0.0268***

    0.2059***

    0.0254***

    0.1233***

    0.0256***

   (0.0057)  

   (0.0025)  

   (0.0321)  

   (0.0022)  

   (0.0188)  

   (0.0022)  

I(exper^2)

   -0.0006** 

   -0.0032** 

   -0.0019** 

   (0.0002)  

   (0.0010)  

   (0.0006)  

age

   -0.0161***

   -0.0161***

   -0.0880***

   -0.0157***

   -0.0529***

   -0.0159***

   (0.0025)  

   (0.0025)  

   (0.0146)  

   (0.0024)  

   (0.0085)  

   (0.0024)  

kidslt6

   -0.2618***

   -0.2618***

   -1.4434***

   -0.2578***

   -0.8683***

   -0.2612***

   (0.0335)  

   (0.0335)  

   (0.2036)  

   (0.0319)  

   (0.1184)  

   (0.0319)  

kidsge6

    0.0130   

    0.0130   

    0.0601   

    0.0107   

    0.0360   

    0.0108   

   (0.0132)  

   (0.0132)  

   (0.0748)  

   (0.0133)  

   (0.0440)  

   (0.0132)  

Num.Obs.

  753       

  753       

  753       

  753       

  753       

  753       

R2

    0.264    

    0.264    

AIC

  865.8      

  865.8      

  819.5      

  819.5      

  818.6      

  818.6      

BIC

  907.4      

  907.4      

  856.5      

  856.5      

  855.6      

  855.6      

Log.Lik.

 -423.892    

 -423.892    

 -401.765    

 -401.765    

 -401.302    

 -401.302    

RMSE

    0.42     

    0.42     

    0.42     

    0.42     

    0.42     

    0.42     

HitRate

    0.73     

    0.73     

    0.74     

    0.74     

    0.73     

    0.73     

Sensitifity

    0.82     

    0.82     

    0.81     

    0.81     

    0.81     

    0.81     

Specificity

    0.62     

    0.62     

    0.64     

    0.64     

    0.63     

    0.63     

Ratio.1.0

    1.32     

    1.32     

    1.32     

    1.32     

    1.32     

    1.32     

+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001


Code
visreg(outlpm, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05), 
       main = "Effect of number of children under 6 on LF-participation - LPM")


Code
visreg(outprobit, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05), 
       main = "Effect of number of children under 6 on LF-participation - Probit")


Code
visreg(outlogit, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05), 
       main = "Effect of number of children under 6 on LF-participation - Logit")


10.2 Maximum likelihood estimation in general

The starting point for obtaining maximum likelihood estimators (MLE) is the assumption that the conditional distribution of the data – the data generating process – is known, except for a finite number of parameters \boldsymbol \theta which have to be estimated

  • So, we have the probability density function, (if y_i is a continuous RV) respectively the probability mass function (if y_i is a discrete RV) for observation i

f(y_i \, | \, \mathbf x_i; \boldsymbol \theta), \ \ \forall i

  • The joint density of n independent and (identically) distributed (i.i.d.) observations for this process is the product of the individual densities

f(y_1, \ldots, y_n \, | \, \mathbf X; \boldsymbol \theta) \ = \ \prod_{i=1}^n f(y_i \, | \, \mathbf x_i; \boldsymbol \theta)) \ = \ L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X)

  • The joint density also defines the Likelihood function L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X), which is a function of the unknown parameters \boldsymbol \theta, given the sample data (\mathbf y, \mathbf X)

    • Note the reversed writing of the variables. This should simply indicate that we are interested in finding the parameters given the data (not that \boldsymbol \theta is random)
  • Remark: If the observations are not independent, as it is very often the case with time series, the joint density is more complicated and involves the covariance matrix of the observations y_i. For more complicated time series modes, in many cases the joint density can be obtained with the so called prediction-error decomposition of the likelihood which is calculated with the help of the Kalman-filter


  • As it usually simper to work with logs, the log-likelihood function is

\ln L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X) \ = \ln L(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \ln f(y_i \, | \, \mathbf x_i, \boldsymbol \theta)

  • We apply the Maximum Likelihood principle: In estimating \boldsymbol \theta, we are looking for parameters \boldsymbol \theta so that the observed sample has the highest probability

\underset{\boldsymbol \theta}{\max} \ \ln L(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \ln f(y_i \, | \, \mathbf x_i, \boldsymbol \theta)

  • The first order conditions of this optimization problem lead to the so called likelihood equations, which are the counterparts to the normal equations of OLS

\dfrac { \partial \ln L}{\partial \boldsymbol \theta} := \ \mathbf g(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \frac {\partial }{\partial \boldsymbol \theta} \, \ln f(y_i | \mathbf x_i, \boldsymbol \theta) \ = \ \mathbf 0

  • The solution of the likelihood equations leads to the estimated parameter vector \hat{\boldsymbol \theta}_{LM}

    • This estimate is identified (estimable) if for any other \boldsymbol \theta, L(\hat{\boldsymbol \theta}_{LM}) \neq L(\boldsymbol \theta) (uniqueness)

    • In most of the cases, the optimization has to rely on numerical optimization procedures like the Newton-Paphson algorithm

  • The second order condition for maximization of the likelihood function requires that the Hessian matrix is negative definite, so that the likelihood function is concave at the optimum


10.2.1 Properties of MLE

If the distributional assumption holds Maximum Likelihood estimators are generally

  • Consistent

    • \operatorname{plim} \hat{\boldsymbol \theta}_{ML} = \boldsymbol \theta_T, \quad with \boldsymbol \theta_T being the true parameter vector

    • But in finite samples, MLE are not generally unbiased
      \text{ }

  • Asymptotically normal with

    \hat {\boldsymbol \theta}_{ML} \ \, \stackrel{a}{\sim} \ \, N \left( \boldsymbol \theta_T ,\, \lbrace -E[ \mathbf H(\boldsymbol \theta_T)] \rbrace^{-1} \right)

    • Thereby, \mathbf H(\boldsymbol \theta_T) is the Hessian Matrix of the log-likelihood function, evaluated at the true parameter vector \boldsymbol \theta_T:

\operatorname {Asy.Var} (\hat {\boldsymbol \theta}_{ML}) \, = \, \ \left( - E \left[ \mathbf H(\boldsymbol \theta_T) \right] \right) ^{-1} \, = \, \left( -E \left[ \dfrac {\partial^2 \ln L}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right] \right)^{-1} \, = \,

\left( -E \, \left[ \begin{array}{ccc} \dfrac {\partial^2 \ln L }{\partial \theta_0 \partial \theta_0} & \cdots & \dfrac {\partial^2 \ln L }{\partial \theta_0 \partial \theta_k} \\ \vdots & \ddots & \vdots \\ \dfrac {\partial^2 \ln L }{\partial \theta_k \partial \theta_0} & \cdots & \dfrac {\partial^2 \ln L }{\partial \theta_k \partial \theta_k} \end{array} \right] \right)^{-1}

\text{ }

  • Asymptotically efficient in the class of consistent and asymptotically normal estimators

    • The Cramer-Rao Lower Bound is the asymptotic lower bound of the variance of a consistent and asymptotically normal estimator. And this can be shown to be equal to the inverse of the Information Matrix

      • The Information matrix \mathbf I_n(\boldsymbol \theta_T) for \boldsymbol \theta_T is defined as the covariance matrix of the gradient \mathbf g, the score, of the log-likelihood function evaluated at the true parameter vector. \mathbf I_n(\boldsymbol \theta_T) measures the amount of information about \boldsymbol \theta_T contained (on average) in a realization of the data

      \mathbf I_n (\boldsymbol \theta_T) := \operatorname {Var} \left( \mathbf g(\boldsymbol \theta_T) \right) = E \left( \mathbf g(\boldsymbol \theta_T) \, \mathbf g(\boldsymbol \theta_T)' \right),

      \text{with } \ \mathbf g(\boldsymbol \theta) := \dfrac {\partial }{\partial \boldsymbol \theta} \ln L(\boldsymbol \theta; \mathbf y), \ \ E[\mathbf g(\boldsymbol \theta_T)] = \mathbf 0

      • Under fairly general conditions, the Information Matrix is equal to the negative expected value of the Hessian Matrix, evaluated at the true parameters. This is the Information matrix equality

      \mathbf I_n (\boldsymbol \theta_T) = \operatorname {Var} \left( \mathbf g(\boldsymbol \theta_T) \right) = -E[\mathbf H(\boldsymbol \theta_T)]

      • The right hand term equals the inverse of the covariance matrix of \hat {\boldsymbol \theta}_{ML}. Hence, the variance of MLE is equal to Cramer-Rao Lower Bound and therefore, the MLE is asymptotically efficient
  • Finite sample property: If the sample is drawn from an exponential family distribution and if there exists an unbiased minimum variance estimator, then it is a MLE


10.2.2 Covariance matrix

  • Sketch of a proof that the MLE is asymptotically normal. We make a first order approximation of the first order conditions and assume that \mathbf y is i.d. (to easy the notation we write \hat{\boldsymbol \theta} instead of \hat{\boldsymbol \theta}_{LM})

\mathbf g (\hat {\boldsymbol \theta}) = \mathbf 0, \ \ \Rightarrow \ \ \mathbf g ( \hat {\boldsymbol \theta}) \approx \mathbf g (\boldsymbol \theta_T) + \underbrace{\mathbf H(\boldsymbol \theta_T)}_{\frac {\partial }{\partial \boldsymbol \theta} \mathbf g(\boldsymbol \theta_T)} (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = \mathbf 0 \ \ \Rightarrow

\mathbf g (\boldsymbol \theta_T) = -\mathbf H(\boldsymbol \theta_T)(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) \ \ \Rightarrow

(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = [-\mathbf H(\boldsymbol \theta_T)]^{-1} \mathbf g (\boldsymbol \theta_T) \ \ \Rightarrow

(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = [- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]^{-1} \frac{1}{n} \mathbf g (\boldsymbol \theta_T) \ \ \Rightarrow

\sqrt n (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = \underbrace {[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]^{-1}}_{\stackrel{p}{\longrightarrow} \ \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1}} \ \ \ \ \cdot \underbrace{\sqrt n\ \bar{\mathbf g} (\boldsymbol \theta_T)}_{\stackrel{d}{\longrightarrow} \ N(\mathbf 0,\, E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)])} \ \ \Rightarrow

\sqrt n (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) \ \stackrel{d}{\longrightarrow} \ N \left( \mathbf 0, \, \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1} \, E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)] \, \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1} \right) \ =

N (\mathbf 0, \, \lbrace E[-\frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1})


  • For the last two we invoke the i.d. assumption and the Information matrix equality. In this case

\mathbf g (\boldsymbol \theta) = \sum\nolimits_{i=1}^n \mathbf g_i(\boldsymbol \theta), \ \ \text{ with } \mathbf g_i(\boldsymbol \theta) := \dfrac {\partial \ln f(y_i|\mathbf x, \boldsymbol \theta)} {\partial \boldsymbol \theta}, \ \ \text{ and } \ \ \mathbf {\bar g} (\boldsymbol \theta) = \frac {1}{n} \sum\nolimits_{i=1}^n \mathbf g_i(\boldsymbol \theta)

\mathbf H(\boldsymbol \theta) = \sum\nolimits_{i=1}^n \mathbf H_i (\boldsymbol \theta), \ \ \text{ with } \mathbf H_i(\boldsymbol \theta) := \dfrac {\partial^2 \ln f(y_i|\mathbf x, \boldsymbol \theta)} {\partial \boldsymbol \theta \, \partial \boldsymbol \theta'}

  • \sqrt n\ \bar{\mathbf g} (\boldsymbol \theta_T) is \sqrt n times the average of random variables \mathbf g_i with an expected value of 0 and covariance matrix \operatorname {Var}\left( \mathbf g_i (\boldsymbol \theta_T) \right) = -E[\mathbf H_i (\boldsymbol \theta_T)] (by the Information matrix equality) and an average covariance matrix of -E[\frac{1}{n} \sum_i \mathbf H_i (\boldsymbol \theta_T)] = -E[\frac {1}{n} \mathbf H (\boldsymbol \theta_T) ]. Hence, we can invoke a CLT

  • -\frac{1}{n} \mathbf H(\boldsymbol \theta_T) is an average too and, by LLN, converges in probability to its expected value -E[\frac {1}{n} \mathbf H(\boldsymbol \theta_T)]

  • From the limiting distribution of \sqrt n (\hat {\boldsymbol \theta}_{LM} - \boldsymbol \theta_T) we get the asymptotic distribution of \hat{\boldsymbol \theta}_{ML}

\hat {\boldsymbol \theta}_{LM} \ \, \stackrel{a}{\sim} \ \, N \left( \boldsymbol \theta_T ,\, \lbrace -E[ \mathbf H(\boldsymbol \theta_T)] \rbrace ^{-1} \right)

  • This basically implies consistency of \hat {\boldsymbol \theta}_{LM} too. The essential point was E[\mathbf g(\boldsymbol \theta_{T})] = \mathbf 0 in third last line of the proof

As the expected value in the covariance matrix formula is generally difficult to obtain, the covariance matrix of the estimated parameters is simply estimated by the inverse of the actual (not the expected) negative Hessian matrix at \hat{\boldsymbol \theta}_{LM}

\widehat{\operatorname {Var} (\hat{\boldsymbol \theta})} \ = \ -\mathbf H(\hat {\boldsymbol \theta})^{-1} \ =

\left( -\dfrac {\partial^2 \ln L(\hat{\boldsymbol \theta})}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right)^{-1} \ = \ \left( - \dfrac { \sum_{i=1}^n \partial^2 \ln f(y_i | \mathbf x; \hat{\boldsymbol \theta})}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right)^{-1} \ = \ \left( -\sum_{i=1}^n \mathbf H_i(\hat {\boldsymbol \theta}) \right)^{-1}

with f(y_i | \mathbf x; \boldsymbol \theta) being the conditional distribution function of observation i

  • Another often used estimator for the covariance matrix involves the outer product of the gradient (score) vectors. This is based on the fact shown above that the Information matrix is defined as the variance of the scores, which is the expected value of the outer product of the gradient vectors

\widehat{\operatorname {Var} (\hat{\boldsymbol \theta})} \ = \ \left( \sum_{i=1}^n \mathbf g_i(\hat{\boldsymbol \theta}) \, \mathbf g_i(\hat{\boldsymbol \theta})' \right)^{-1}

  • This is the so called BHHH estimator and is comparatively easy to obtain. This estimate is always positive definite, but in small samples the estimator using the Hessian Marix generally performs better

  • Why does the variance of the estimated parameters depend on the second derivatives?

    • The second derivative measures how concave the function is. If this is (in absolute terms) small at the optimum (nearly linear, with a slope of 0), the likelihood function is flat at the optimum. This means that there is a large interval of other parameters having nearly the same likelihood. Hence, the variance of the estimated \theta is high

    • On the other hand, if the second derivative is large in absolute terms, the likelihood function is very curved at the optimum, meaning that there is only a small interval of parameters having nearly the same likelihood. Therefore, the estimate is precise and the variance of \hat \theta is low

  • Hence, we have an inverse relationship between the size of the second derivatives of the likelihood function and the variance of the estimated parameters; and this inverse rekationship is reflected with the covariance matrix formula

  • As we already know, the second order condition for maximization of the likelihood function requires that the Hessian matrix is negative definite, so that the likelihood function is concave at the optimum

    • For the Logit model we have: \mathbf H = - \sum_i \Lambda(\mathbf x_i \boldsymbol \beta) (1-\Lambda(\mathbf x_i \boldsymbol \beta) )\mathbf x_i \mathbf x_i', which is clearly negative definite for all \boldsymbol \theta

    • As similar result applies to the Probit model which means that the likelihood functions are always concave for the two models, guaranteeing a global maximum


10.2.3 Hypothesis testing

  • Because we have an estimate of the covariance matrix of \hat {\boldsymbol \theta} we can carry out all of the usual Wald tests about linear hypothesis (after linearization, nonlinear restrictions are also possible - the restriction matrix \mathbf R is replaced by the Jacobi Matrix of the nonlinear restrictions)

    • With this test procedure, only the unrestricted model have to be estimated. If the restriction is not compatible with the data, i.e., the restricted estimate \mathbf R \hat{\boldsymbol \theta} is too far away from the hypothesized value, the restriction is rejected

    • The big advantage of this test strategy is that with a large enough sample no assumption about the distribution is required (we do not need the likelihood function)


  • In the context of MLE, we have an estimate of the log-likelihood of the observed sample. This can be exploited for a Likelihood Ratio (LR) test

    • With this procedure, the restricted as well as the unrestricted model is estimated. We are then looking at the log-ratio of the likelihood of the two models. If the difference of the log-likelihood of the two models is too large, the restricted model can be rejected. The test statistic is asymptotically \chi^2-distributed with m degrees of freedom, m being the number of restrictions

LR \ = \ 2 \left( \ln L(\hat{\boldsymbol \theta}_{ur}) - \ln L(\hat{\boldsymbol \theta}_r) \right) \ \stackrel{a}{\sim} \ \chi^2(m)


  • The third test concept, which is also derived from the Maximum Likelihood principle, is the Lagrange Multiplier test (LM)

    • With this procedure, only the restricted model is estimated. To obtain the parameter of the restricted model the log-likelihood function is maximized with respect to m constraints \mathbf c(\boldsymbol \theta)=\mathbf 0, using the Lagrange function

    \underset{\boldsymbol \theta}{\max}: \ \ln L(\boldsymbol \theta; \mathbf y, \mathbf X) - \boldsymbol \lambda' \mathbf c(\boldsymbol \theta)

    • The first order condition shows that a restriction is only binding (has an influence on the estimated parameters) if the corresponding element of the (m \, \text{x} \, 1) Lagrange multiplier vector \boldsymbol \lambda is different from zero

    \underbrace {\dfrac {\partial }{\partial \boldsymbol \theta} \ln L(\boldsymbol \theta; \mathbf y, \mathbf X)}_{\mathbf g(\boldsymbol \theta)} - \boldsymbol \lambda' \dfrac {\partial }{\partial \boldsymbol \theta} \mathbf c(\boldsymbol \theta) = \mathbf 0

    • If the restriction is of the form \theta_j = c_j, with c_j some constants, the corresponding derivative of the constraint is one, \dfrac {\partial }{\partial \theta_j} \mathbf c(\boldsymbol \theta) = 1 and then, from the first order conditions, we have

    \dfrac {\partial }{\partial \theta_j} \ln L(\boldsymbol \theta; \mathbf y, \mathbf X) = \lambda_{\tilde j}

    • Thus, the vector \boldsymbol \lambda, which is often interpreted as shadow prices for the restrictions \theta_j = c_j, is equal to the corresponding derivatives of the log-likelihood function. Therefore, \lambda_{\tilde j} measures the extent to which the first order conditions with respect to to the unconstrained model are violated (without restriction, the corresponding derivative should differ from \mathbf 0 only because of sampling errors)

    • Hence, one idea is to test whether \boldsymbol \lambda is zero. A similar and more common test is directly based on the gradient vectors \mathbf g(\boldsymbol \theta_r), the score vectors, instead of \boldsymbol \lambda; hence, the label score test. This as well measures to what extent the gradient differs from \mathbf 0, the expected value of \mathbf g(\boldsymbol \theta) in the unconstrained optimum

    • It turns out that \mathbf g(\boldsymbol \theta) is asymptotically normal distributed and we have a “Wald-like” LM test statistic

    LM \ = \ \mathbf g(\boldsymbol \theta_r)' \, \mathbf I(\boldsymbol \theta_r)^{-1} \mathbf g(\boldsymbol \theta_r) \ \ \stackrel{a}{\sim} \ \ \chi^2(m)

    with \mathbf I(\boldsymbol \theta_r) equal to of the inverse Information matrix, evaluated at the restricted parameter vector


  • An asymptotic equivalent procedure to the score test is to first run a restricted regression and then, in a second step, an auxiliary regression using the residuals of the first step to test for relaxation of the restrictions

    • Typically, the test statistic for this variant of the LM test is \, n \cdot R^2, which is \chi^2(m) distributed

    • This procedure is especially suited for testing misspecifications, as in this case the restricted variant is typically more easily to estimate than the unrestricted model. Examples are tests for

      • Heteroskedasticity (Breusch-Pagan test)

      • Serial correlation (Breuch-Godfrey test)

      • General misspecifications (RESET)

      • Endogenous regressors (Hausman-Wu test)

  • Tests for exclusions of variables or linear restrictions among variables are typically more easily to carry out with a Wald test

  • All three test procedures introduced – Wald tests, LR tests and LM tests – are only suitable for nested models and are asymptotically equivalent

    • The small sample properties of LR and LM tests are generally unknown and the LM test is said to have large small samples distortions. On the other hand, the LM test is known to have a larger power against small deviations from the H_0

Three test procedures

Code
library(magick)
img = image_read("topic3_tests.jpg")
image_scale(img, "600")

Figure 10.3: Principles of the three different test procedures, source Green (2017)

10.2.4 Measures of fit

  • We already dealed with the hit rate in the context binary choice models, with the refinements of sensitivity and specificity

  • Pseudo R-squared: Compare the maximized log-likelihood of the model with that of a model that only contains a constant (and no explanatory variables); McFadden’s \tilde{R}^2

\tilde{R}^{2}=1- \dfrac {\ln L_{u r}} {\ln L_{0}}

  • Correlation based measures: Look at correlation (or squared correlation) between predicted values or predicted probabilities versus observed values

    \operatorname {Corr} (y_i, \hat y_i)

    • In the context of continuous dependent variables models, this corresponds to the usual R^2
  • Information criteria:

    • Akaike information criterion, AIC = -2 \ln L + 2k

    • Bayesian (Schwarz) information criterion BIC = -2 \ln L + k \ln n

      • Both are a measure of the information loss due to an imperfect model. They deal with the trade-off between the goodness of fit and the simplicity of the model, thereby considering both the risk of overfitting and underfitting

      • They are based on the log-likelihood of the model but place a penalty to the number of estimated parameters like the R-squared bar

      • These criteria can be used to judge also on non-nested models (if y is the same): The lower the better – this strategy is said to lead to better out of sample predictions


10.2.5 Digression: Intoduction to Bayesian Statistics

For an introduction to Bayesian statistics watch that video:

https://www.youtube.com/watch?v=Pahyv9i_X2k

As the video above shows, the basic idea of Bayesian statistics is to update prior information about a parameter vector \boldsymbol \theta with observed data. Bayes Theorem implies that the posterior distribution of the parameter vector \boldsymbol \theta, f(\boldsymbol \theta|data), is proportional to the likelihood function f(data|\boldsymbol \theta) times the prior distribution f(\boldsymbol \theta). An important economic application is estimation of DSGE models.

However, the posterior distribution is often very complicated and difficult or impossible to integrate. For instance, we need to integrate to get the expectation of the posterior E(\boldsymbol \theta|data) = \int_\theta \boldsymbol \theta f(\boldsymbol \theta|data) d\boldsymbol \theta.

A solution of this problem are sampling algorithms, which draw samples of \boldsymbol \theta from an arbitrary proposal distribution in a way so that we finally get draws that are asymptotically distributed like the posterior distribution. If we have such a large sample of \boldsymbol \theta from the posterior it is very easy to get the expectation of \boldsymbol \theta (its simply the mean) or credible intervals (the Bayesian counterparts to confidence intervals).

The Markov Chain Monte Carlo (MCMC), and in particular the Metropolis Hastings (MH) algorithms are such powerful sampling algorithms. By the way, this algorithms are also very important in many other fields like climate modeling, physics or big data econometrics. To learn more watch the following video:

https://www.youtube.com/watch?v=U561HGMWjcw


10.3 Count data

Count data arise when the dependent variable is a counting number per unit of time or space – e.g., the number of restaurants meals taken by a consumer during a week or the number of arrest of people during a year. So, y_i = \lbrace 0, 1, 2, \ldots, \rbrace

  • If these numbers are large and widespread, the variable can be considered as continuous and hence, OLS would be appropriate

  • If these numbers are small, quite rare events, then the continuous approximation is a poor one

    • In this case, OLS can still yield consistent estimates and the predicted values are interpreted as conditional expected values of the independent variable

    • However, OLS does not account for the special structure of the data, leading to inefficiency and nonsense predictions, like -0.2 restaurant meals per week

  • One remedy to avoid negative predictions is to postulate

E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}

  • and apply NLS. However, this would be not efficient as the the error term is heteroscedastic

  • Therefore, maximum likelihood estimation is usually employed for this type of models. As count data are often Poisson distributed this leads to the Poisson regression model (or to the more general negative binomial regression model)


  • The Poisson distribution is derived form a limit of the binomial distribution, which we already encountered with the Logit and Probit models. If the number of independent (Bernoulli) draws goes to infinity and the success probability p goes to zero, so that n\cdot p \rightarrow \lambda, the binomial distrib. converges to

    P(Y = y_i) = \dfrac {\lambda^{y_i}}{y_i!} e^{-\lambda}

    And this is called Poisson distribution, with y_i the number of counts

  • The parameter \lambda is equal to both the expectation and the variance of counts! This distribution is considerably right-skewed for a small \lambda, but for \lambda>30, it’s almost identical to a N(\lambda, \lambda) normal distribution

  • As we are interested in the effects of the explanatory variables on y (the number of counts), we must look at the Poisson distribution conditional on \mathbf x. We only have to specify \lambda = E[y \, | \, \mathbf x], which we set to

    \lambda = E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}

    as above – \lambda has to be positive!


Given a sample of n observations, we can construct the log-likelihood function, which is the joint log-probability that each y_i is observed

\ln L(\boldsymbol \beta) \, = \, \sum\nolimits_{i=1}^{n} y_i \ln \lambda - \lambda - \ln(y_i!) \, = \, \sum\nolimits_{i=1}^{n} y_i \, {\mathbf x_i \boldsymbol \beta} - e^{\mathbf x_i \boldsymbol \beta} - \ln(y_i!)

Note, that the last term, \ln(y_i!), is independent of \boldsymbol \beta and could be omitted

  • The estimates \hat{\boldsymbol \beta} are obtained by numerical maximization of the log-likelihood function

  • As we have

    E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}

    the partial effects are

    \dfrac {\partial E[y \, | \, \mathbf x]}{\partial x_j} = {\underbrace {e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}}_{E[y \, | \, \mathbf x]}} \, \beta_j \ \ \Rightarrow \tag{10.2}

    \dfrac {\partial E[y \, | \, \mathbf x]}{\partial x_j} \dfrac {1}{E[y \, | \, \mathbf x]} \ = \ \dfrac {\partial \ln E[y \, | \, \mathbf x]}{\partial x_j}{} \ = \ \beta_j \tag{10.3}

    This is a semi-elasticity. A change in x_j of 0.01 leads to a \beta_j percentage change in E[y \, | \, \mathbf x]. For larger variations in x_j (e.g. for a dummy variable) use the exact formula (e^{\beta_j \Delta x_j}-1)100=\Delta%


  • The Poisson regression model assumes that \operatorname{Var}(y|\mathbf x) = E(y|\mathbf x) which is very restrictive and often violated in applied work

  • If \operatorname{Var}(y|\mathbf x) > E(y|\mathbf x), we have overdispersion (which is quite common), in the reverse case underdispersion (which is not so common)

  • Fortunately, MLE of the Poisson regression model are consistent and asymptotically normal even if the Poisson distribution does not hold. This is then called Quasi-Maximum Likelihood estimation (QML). (This is similar to the case of OLS when the disturbance is not normally distributed)

  • If the distributional assumption does not hold and QML is used, standard errors are wrong. One then has to compute adjusted standard errors

    • The residuals of the Poisson model are as usual \hat u_i = y_i-\hat y_i. To judge whether there is overdispersion the residuals are scaled (divided) by \sqrt{\hat y_i}. This is a proper heteroscedasticity correction. Additionally, \hat y is a predictor for E(y) and therefore also for \operatorname{Var}(y|\mathbf x). Hence, the variance of these scaled residuals should be one if the condition \operatorname{Var}(y|\mathbf x) = E(y|\mathbf x) is true

    • If the calculated standard error of these transformed residuals \hat \sigma is larger than one we have overdispersion. In this case the estimated standard errors of the coefficients should be multiplied by \hat \sigma


10.3.1 Example

  • As an example we analyze the crime1 data set of the Wooldridge Package. The dependent variable narr86 is the number of times a man is arrested during 1986. This variable is zero for 1970 of the 2725 men in the sample and only eight values of narr86 are larger than five. So, the Poisson regression model is more appropriate than OLS

  • Some interesting exploratory variables are

    • pcnv: proportion of prior convictions
    • tottime: time in prison since 18, in month
    • ptime86: month in prison during 1986
    • inc86: legal income 1986, in $100s
    • black: =1 if black
    • hispan: =1 if Hispanic
  • We estimate the model with OLS, as well as Poisson regression model and Quasi-Poisson regression model, the latter only differ by the correction of the standard errors because of overdispersion

  • Additionally, we calculate the APE, which are comparable to the OLS estimates


  • The OLS estimate for pcnv is -0.132 which means that with \Deltapcvn = 0.1, the expected number of arrest during 1986 decreases by 0.013

  • The corresponding coefficient of the Poisson model is 0.402. But this is now a semi-elasticity: With \Deltapcvn = 0.1, the expected number of arrest during 1986 decreases by 4.02%. Calculating the APE (which averages Equation 10.2) with the Package marginaleffects, we get the absolute effect of -0.162, which is now comparable to the OLS estimate

  • The effect of black in the OLS model is 0.327, meaning that for black men the expected number of convictions is 0.327 higher than for white men

  • The Poisson coefficient for black is 0.661. In this case (a large \Delta x_j), we should use the precise formula, e^{\beta_j \Delta x_j}-1, leading to exp(0.661)-1, which is about 94% higher than for white man. Calculating the APE (which averages Equation 10.2), we get the effect in absolute terms, 0.328, which is about the same as the estimated OLS effect

  • The SE of the transformed residuals of the Poisson model is 1.232, showing a moderate overdispersion. The column QPoisson shows the corrected standard errors for the estimated \beta_j

  • The estimated R^2, which is the squared correlation of actual versus predicted values of narr86, shows that the overall fit of the Poisson regression model is better than that of the linear regression model, although the OLS coefficients are chosen to maximize R^2. But OLS is restricted to a linear relationship


library(wooldridge)
data("crime1")

# OLS
outlm <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + 
               inc86 + black + hispan + born60 , 
             data = crime1)


# Poisson - family = poisson and the default link is "log", 
# so the inverse default link is "exp"
outpoisson <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + 
                    inc86 + black + hispan + born60 , 
                  data = crime1, family = poisson)


# Quasi-Poisson - this only adjusts the standard errors 
outquasipoisson <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + 
                         inc86 + black + hispan + born60 , 
                       data = crime1, family = quasipoisson)


# Calculating APE with "marginaleffects"
marg_lm <- avg_slopes(outlm)
marg_pois <- avg_slopes(outpoisson)
marg_qpois <- avg_slopes(outquasipoisson)

Code
# Reset for modelsummery for glm()
glance_custom.glm <- function(x, ...) {
  
}

modelsummary( list("OLS"=outlm, "Poisson"=outpoisson, "Poisson APE"=marg_pois,
                   "QPoisson"=outquasipoisson, "QPoisson APE"=marg_qpois), 
              output="flextable", 
              gof_omit = "F", 
              align = "lddddd", 
              stars = TRUE, 
              fmt = 3)
Table 10.2:

Number of arrests, depending on serveral variables, several count models

OLS

Poisson

Poisson APE

QPoisson

QPoisson APE

(Intercept)

    0.577***

   -0.600***

  -0.600***

   (0.038)  

   (0.067)  

  (0.083)  

pcnv

   -0.132** 

   -0.402***

   -0.162***

  -0.402***

  -0.162***

   (0.040)  

   (0.085)  

   (0.035)  

  (0.105)  

  (0.043)  

avgsen

   -0.011   

   -0.024   

   -0.010   

  -0.024   

  -0.010   

   (0.012)  

   (0.020)  

   (0.008)  

  (0.025)  

  (0.010)  

tottime

    0.012   

    0.024+  

    0.010+  

   0.024   

   0.010   

   (0.009)  

   (0.015)  

   (0.006)  

  (0.018)  

  (0.007)  

ptime86

   -0.041***

   -0.099***

   -0.040***

  -0.099***

  -0.040***

   (0.009)  

   (0.021)  

   (0.008)  

  (0.025)  

  (0.010)  

qemp86

   -0.051***

   -0.038   

   -0.015   

  -0.038   

  -0.015   

   (0.014)  

   (0.029)  

   (0.012)  

  (0.036)  

  (0.014)  

inc86

   -0.001***

   -0.008***

   -0.003***

  -0.008***

  -0.003***

   (0.000)  

   (0.001)  

   (0.000)  

  (0.001)  

  (0.001)  

black

    0.327***

    0.661***

    0.328***

   0.661***

   0.328***

   (0.045)  

   (0.074)  

   (0.045)  

  (0.091)  

  (0.055)  

hispan

    0.194***

    0.500***

    0.235***

   0.500***

   0.235***

   (0.040)  

   (0.074)  

   (0.040)  

  (0.091)  

  (0.050)  

born60

   -0.022   

   -0.051   

   -0.020   

  -0.051   

  -0.020   

   (0.033)  

   (0.064)  

   (0.025)  

  (0.079)  

  (0.031)  

Num.Obs.

 2725      

 2725      

 2725      

2725      

2725      

R2

    0.072   

AIC

 6721.4     

 4517.5     

 4517.5     

BIC

 6786.4     

 4576.6     

 4576.6     

Log.Lik.

-3349.678   

-2248.761   

-2248.761   

RMSE

    0.83    

    0.83    

    0.83    

   0.83    

   0.83    

+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001


10.4 Censored and truncated data

  • In many economic contexts, decision problems are such that either a positive amount or a zero amount is chosen, e.g., the demand for cars or alcohol (corner solutions). In this case, we have a form of censored data and the distribution of car expenditures is a combination of a discrete distribution (at 0) and a continuous distribution

  • Another form of censored data is when the dependent variable is censored from above or below a certain threshold, typically due to survey design; e.g. if respondents of a survey are allowed to respond a question about their wealth with more than 1 Mil. Typically, there is no optimization problem modeled

  • In the first example for censored data we have expenditure data for buyers and non-buyers (which are zero). However, if the data are collected from sales tax records, then the data would only include buyers

  • Data, in which observations are unavailable above or below a certain threshold are called truncated data. Note, for truncated data we neither observe y_i nor \mathbf x_i, in contrast to censored data. The sample is not random any more if the selection mechanism is related to the value of the dependent variable

    • Remark: Selection mechanisms which are exogenous, i.e., the dependent variable and its error term are not involved, pose no special estimation problems

      • Example:

        • Suppose we want to estimate the relationship between the saving rate and income, but we have oversampled people with high income (exogenous sampling mechanism, which is based on an exogenous (rhs) variable). If the same relationship between savings and income is true for all people, this would not lead to biased estimates, (although with higher variance – less variation in the explanatory variable)

        • However, if we have oversampled people with a high saving rate (endogenous sampling mechanism) this would lead to biased estimates even if the same relationship between savings and income is true for all people (sample selection problem)

  • A linear regression model may be inadequate for both censured and truncated data

  • Figure 10.4 below shows the effects of censored and truncated data for estimating the true (population) relationship. In both cases we see a significant bias towards zero in the estimates based on the observed data. Note, regarding the truncated sample, we have an endogenous sampling mechanism related to the value of y – thus there is a sample selection problem


Code
library("ggplot2")
library("dplyr")
library("tidyverse")

baw <- FALSE
set.seed(2020)
N <- 1000
X = rnorm(N, 3, 4)
Y = - 1 + 0.5 * X + rnorm(N, 0, 1.8)
z <- data.frame(X = X, Y = Y)

z <- mutate(z,
            neg = factor(Y < 0, labels = c("yes", "no")),
            Yc = ifelse(Y > 0 , Y, 0),
            Yt = ifelse(Y > 0, Y, NA))
z2 <- gather(z, key, value, -X, - neg)

colols <- ifelse(baw, "black", "#AA4444")

ys <- c("Y" = "Whole sample", "Yc" = "Censored sample", "Yt" = "Truncated sample")

gp <- ggplot(z2, aes(X, value)) + geom_point(size = 0.6, aes(colour = neg)) + 
    geom_smooth(method = "lm", se = FALSE,  lty = "twodash", color = colols) +
    geom_abline(intercept = -1, slope = 0.5) +
    facet_wrap(~ key, ncol = 3, labeller = labeller(key = ys)) +
    guides(colour = FALSE)  + xlab("") + ylab("") + 
    theme(legend.text = element_text(size = 6), 
          legend.title= element_text(size = 8),
          axis.title = element_text(size = 8))

if (baw) gp <- gp + scale_color_grey()
gp 

Figure 10.4: Simulated data, censored or truncated, and the corresponding regression lines compared to the true relationship

Some examples for typical count and censored data

library(wooldridge);  
data("crime1")

# Number of arrest - typical count data
hist(crime1$narr86, main = "Typical count data, should look like a Poisson distribution")


library(wooldridge);  
data("mroz")

# Number of hours worked - not everybody work 
# Data apparently not Poisson distributed
hist(mroz$hours, main = "Typical censored data, (>= 0)")


library(wooldridge);  
data("recid") 

# Duration time until next arrest. We have different release times from prison, 
# hence different thresholds as the endpoint for observations is fixed
hist(recid$durat, main = "Censored data from above with different but known thresholds")


10.4.1 Truncated Normal Regression Model

Let’s start with the analysis of truncated regression models. Characteristic for these is that the outcome y_i and the explanatory variables \mathbf x_i are only observed if the outcome y_i is less (larger) or equal some value c_i

  • In this case, the sample is not a random sample from the population (because some units will never be a part of the sample)

  • The truncated normal regression model is:

y_i = \mathbf x_i \boldsymbol \beta + u_i

u_i \, | \, \mathbf x_i \sim N(0, \sigma^2)

(y_i, \mathbf x_i) \ \text{ only observed if } \ y_i > c_i \ \ \ (\text{or } \ y_i < c_i) \tag{10.4}

  • Truncation from below or left: observe y_i and \mathbf x_i if y_i > c_i

  • Truncation from above or right: observe y_i and \mathbf x_i if y_i < c_i

  • In the following we examine truncation from below. Truncation from above is handled analogous and does not require any new results, Green (2017)


Truncated normal distribution

Theorem 10.1 (Truncated normal distribution, see Green (2017)) Let y be a normally distributed random variable with density f(y), mean E(y)=\mu and variance \operatorname {Var}(y)=\sigma^2 and a cutoff constant c. Then the conditional density function f(y|y>c) is

f(y) \ \equiv \ f(y|y>c)\cdot P(y>c) \ \ \Rightarrow

f(y|y>c) = \frac{f(y)}{P(y>c)} = \frac{f(y)}{P(y-\mu>c-\mu)} =

\frac{f(y)}{P[(y-\mu)/\sigma > (c-\mu)/\sigma]} \, = \, \frac{f(y)}{1-F[(c-\mu)/\sigma]} \, =

\frac{\frac{1}{\sigma}\phi \left( \dfrac{y-\mu}{\sigma}\right)}{1-\Phi\left(\dfrac{c-\mu}{\sigma}\right)} \ \tag{10.5}

As already defined above, \phi(z) denotes the probability density and \Phi(z) the cumulative distribution function (cdf) of the standard normally distribution (see Equation A.3 and Equation A.4).

The mean and variance are given by:

\begin{align*}E(y|y>c) &= \mu + \sigma \, \lambda \left(\frac{\mu-c}{\sigma} \right) \\ \operatorname{Var}(y|y>c) &= \sigma^2 \left[ 1-\delta \left(\frac{\mu-c}{\sigma}\right) \right] \end{align*} \tag{10.6}

where \lambda(x) is the Inverse Mills Ratio (or hazard function of the standard normal distribution): 1

\lambda(x) := \dfrac {\phi(x)}{\Phi(x)}, \ \ \lambda(x) >0, \ \lambda'(x) <0, \ \forall x \tag{10.7}

\delta(x) := \lambda(x)[\lambda(x)-x], \ \ 0<\delta(x)<1

Remarks: E(y|y>c) > E(y) and \operatorname {Var}(y|y>c) < \operatorname {Var}(y).

The Inverse Mills Ratio \lambda(x) (IMR) is fundamental in models with incomplete data or self-selective samples. It can be interpreted as a measure of the probability that a particular observation is selected (is in the sample) compared to those that are not selected (not in the sample). A lower IMR indicates a higher probability of selection.


Code
x <- seq(-3, 3, length=100)

# plotting normal distribution
plot(x, dnorm(x), type="l", ylim=c(0, 1.125)) 

# plotting truncated normal distribution, truncated from below at -0.5
a=-0.5
xt <- seq(a, 3, length=100)
lines(xt, dnorm(xt)/(1-pnorm(a)) )
o <- dnorm(a)/(1-pnorm(a))
segments( a,0,  x1=a, y1=o, lty = 3) 

# plotting truncated normal distribution, truncated from below at 0.5
a=0.5
xt <- seq(a, 3, length=100)
lines(xt, dnorm(xt)/(1-pnorm(a)) )
o <- dnorm(a)/(1-pnorm(a))
segments( a,0,  x1=a, y1=o, lty = 3) 

Figure 10.5: Normal and truncated normal distributions for c=-0.5 and c=0.5. Note, the area below the density has to be always one

Estimation

Applying OLS would not yield correct results because MLR.2 is violated (see Section 2.7). To see this note that, according to the theorem above, the expectation of y_i for the observed subpopulation y_i > c_i is:

E(y_i|y>c_i) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \lambda \left( \frac{\mathbf x_i \boldsymbol \beta - c_i }{\sigma} \right) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \tag{10.8}

  • Hence, regressing y_i only on \mathbf x_i leads to an omitted variable bias as the omitted variable \lambda(\mathbf x_i) (the inverse Mills Ratio, Equation 10.7) is a nonlinear function of \mathbf x_i and therefore probably correlated with \mathbf x_i

    • In particular, for observations with a low a-priory probability to be in the sample (because \mathbf x_i \boldsymbol \beta is low), the variable \lambda(\mathbf x_i) is high, thereby correcting for the selection bias in Equation 10.8

The partial effect of x_j on E(y_i|y>c_i), which we want to estimate, is:

\dfrac {\partial E(y_i|y>c_i)}{\partial x_j} \, = \, \beta_j + \sigma \, \lambda' \, \dfrac {\partial \left( \frac{\mathbf x_i \boldsymbol \beta - c_i }{\sigma} \right) }{\partial x_j} \, = \, \beta_j(1 + \underset {(-)} {\lambda'}) \tag{10.9}

  • This can be shown to be less than \beta_j. Hence, we have a bias towards zero (compare Figure 10.4) 2

As just shown, OLS is biased and inconsistent for truncated data because of an omitted variable problem.

Thus, Maximum Likelihood estimation seems appropriate.

  • According to Theorem 10.1, the density of an observed outcome y_i, conditional on the explanatory variables and the y > c_i is:

f\left(y_{i} \, | \, \mathbf{x}_{i}, y_i > c_{i}\right) \ = \ \frac{\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} {1 - \Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \tag{10.10}

Maximizing the log-Likelihood function yields consistent estimates of \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2

\underset{\boldsymbol \beta, \sigma}{\max} \ \ln L(\boldsymbol \beta, \sigma^2) \ = \ \sum_{i=1}^n \ln \frac {1}{\sigma }{\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} - \ln \left[ 1 - {\Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \right]

\rightarrow \ \ \ \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2

  • Note, that non-normality or heteroscedasticity in u_i lead to inconsistent estimates

  • Further note, the untruncated data should be approximately continuous to be appropriately modeled with this approach (not count data with only few counts!)


Example

Code
# We simulate data and store them in the data.frame d
set.seed(1234890)
n <- 10000

# true parameters, which we want to estimate 
sigma <- 4
beta_0 <- 2
beta_1 <- 1;

x <- rnorm(n, mean = 0, sd = 2); 
eps <- rnorm(n, sd = sigma)

# the true model 
y <- beta_0 + beta_1 * x + eps

d <- data.frame(y = y, x = x)
library(truncreg) # Truncated Regression package

# generating truncated data y = yt, at c <= 1
d$yt <- ifelse(d$y > 1, d$y, NA)

# compare estimates for full/truncated/ data
# we use the function `truncreg` from the package `truncreg`
full  <- lm(y ~ x, data = d)
ols   <- lm(yt ~ x, data = d)
trunc <- truncreg(yt ~ x, data = d, point = 1, direction = "left")

summary(trunc)
      
      Call:
      truncreg(formula = yt ~ x, data = d, point = 1, direction = "left")
      
      BFGS maximization method
      45 iterations, 0h:0m:0s 
      g'(-H)^-1g = 7.38E-09 
       
      
      
      Coefficients :
                  Estimate Std. Error t-value  Pr(>|t|)    
      (Intercept) 2.119760   0.170904  12.403 < 2.2e-16 ***
      x           0.994784   0.046537  21.376 < 2.2e-16 ***
      sigma       3.972827   0.079421  50.022 < 2.2e-16 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Log-Likelihood: -13547 on 3 Df

Code
modelsummary( list("OLS full data"=full, "OLS trunc data"=ols, "MLE trunc data"=trunc), 
              output="flextable", 
              gof_map = c("nobs", "aic", "bic", "rmse"), 
              align = "lddd", 
              stars = TRUE, 
              fmt = 3)
Table 10.3:

Simulated truncated data, estimated with OLS and MLE. True values are b0=2, b1=1 and sigma=4

OLS full data

OLS trunc data

MLE trunc data

(Intercept)

    2.081***

    4.707***

    2.120***

   (0.040)  

   (0.037)  

   (0.171)  

x

    1.035***

    0.474***

    0.995***

   (0.020)  

   (0.019)  

   (0.047)  

sigma

    3.973***

   (0.079)  

Num.Obs.

10000      

 5967      

 5967      

AIC

56035.8     

29009.9     

27099.7     

BIC

56057.4     

29030.0     

27119.8     

RMSE

    3.99    

    2.75    

    3.69    

+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001


10.4.2 Censored Normal Regression Model

  • Generally, censored data do not result from a truncation problem. Contrary to truncated data, especially the explanatory variables \mathbf x_i are observed even if the outcome y_i is less (larger) or equal some value c_i

  • Typically, a dependent variable is censored from above or below a certain threshold due to survey design; e.g., values are only reported up to a certain level (e.g., top coded wealth)

    • Another example for censoring from below is limited measurement accuracy, for instance tumor maker values or count of antibodies in medical data
  • With y_i^* being the true outcome and y_i being the observed outcome, the censored normal regression model is

y_i^* = \mathbf x_i \boldsymbol \beta + u_i

u_i \, | \, \mathbf x_i \sim N(0, \, \sigma^2)

y_i = \begin{cases} \max(y_i^*,c_i) \ \ \ \text{:left censoring (from below)} \\ \min(y_i^*,c_i) \ \ \ \ \text{:right censoring (from above)} \\ \max(y_i^*,\underline {c}_i) \ \land \ \min(y_i^*,\bar c_i) \ \ \ \text{:censoring from above and below} \end{cases} \tag{10.11}

  • In the following we examine censoring from below. Censoring from above is handled analogous

Distribution of censored data

Figure 10.6: Censored normal distributions at c=0, mixture of discrete & continuous distribution. In this example, half of the probabilty mass is concentrated in one point

For censored data, the distribution function of y_i is a mixture of a discrete distribution, concentrated at the threshold c_i, and a truncated continuous density; the normal density in our case

  • Such a distribution is shown in Figure 10.6 above with threshold c_i=0

    • Note, in this case no renormalization of the truncated continuous density is necessary, because the area below the truncated density function plus the probability of the discrete mass point is one
  • We have two parts of the distribution of y_i (all conditional on \mathbf x_i, which we leave out to save notation):

    1. For (left) censored observations, y_i = c_i, we have a discrete mass point with probability:

    P(y_i = c_i) \,=\, P(y_i^* \leq c_i) \,=\, P(u_i \leq c_i - \mathbf x_i \boldsymbol \beta) \,=\, P \left( \frac {u_i}{\sigma} \leq \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \,=

    \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.12}

    1. For the uncensored observations, y_i = y_i^* > c_i, the density of y_i, conditional on y_i>c_i, corresponds to a truncated density f\left(y_{i} \, | \, y_i > c_{i}\right), already derived in the section on truncated data, see Equation 10.5 and Equation 10.10
  • For the whole two-part distribution of y_i we have

f(y_i \,|\, \mathbf x_i, c_i) \ = \begin{cases} P(y_i=c_i) = \Phi \left( \dfrac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right); \ \ \text{ if } y_i=c_i \\ f(y_{i} \, | \, y_i > c_{i}) \cdot P(y_i>c_i); \ \ \text{ if } y_i>c_i \end{cases} \tag{10.13}

  • The second part simplifies to

f(y_{i} \, | \, y_i > c_{i}) \cdot P(y_i>c_i) \ = \ \underbrace {\frac{\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} {1 - \Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)}}_{ f(y_{i} \, | \, y_i > c_{i})} \underbrace {\left [1 - \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \right ]}_{P(y_i>c_i)} \ =

\underbrace {\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)}_{f(y_i^*)=N(\mathbf x_i \boldsymbol \beta, \, \sigma^2)} \tag{10.14}

  • And this is simply the non-truncated normal distribution of the uncensored observations (as mentioned above, there is no need for a renormalization)

Estimation

This lead to the joint density of the observed vector \mathbf y:

f(\mathbf y \, | \, \mathbf X,c_i) \ = \ \prod_{y_i=c_i} \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \cdot \prod_{y_i>c_i} \frac{1}{\sigma} \phi \left( \frac {y_i-\mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.15}

  • Maximizing the log-likelihood function yields consistent estimates of \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2

\underset{\boldsymbol \beta, \sigma}{\max} \ \ln L(\boldsymbol \beta, \sigma^2) \ = \ \sum_{y_i=c_i} \ln \left[ {\Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \right] \ + \ \sum_{y_i>c_i}^n \ln \frac {1}{\sigma }{\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} \ \ \rightarrow \ \ \hat {\boldsymbol \beta}, \hat \sigma^2

  • Estimation has to be done by numerical optimization methods

  • After estimation, parameter test and exclusion restrictions can be done as usual with Wald, Likelihood Ratio or Lagrange Multiplier tests

  • Note, if the distributional assumptions on u_i (normality and homoscedasticity) are violated, the MLE above is inconsistent

  • Further note, the uncensored data should be approximately continuous to be appropriately modeled with this approach (not count data with only few counts!)


OLS applied to the observed data doesn’t yield consistent estimates for \boldsymbol \beta as the expectation of y_i, conditional on \mathbf x_i and y_i\geq c_i, is not \mathbf x_i \boldsymbol \beta but a highly nonlinear function in \mathbf x_i and \sigma

E(y_i) \ = \ E(y_i|y \geq c_i) \ = \ E(y_i|y_i > c_i) \cdot P(y_i > c_i) \, + \, \underbrace {E(y_i|y_i = c_i)}_{c_i} \cdot P(y_i = c_i)

Plugging in our known results regarding the truncation case E(y_i|y_i > c_i) (Equation 10.8) and for P(y_i = c_i) (Equation 10.12), respectively P(y_i > c_i), we have

E(y_i) \ = \ \underbrace {\left[ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \right]}_{E(y_i|y_i > c_i)} \underbrace {\left[ 1 -\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \right]}_{P(y_i>c_i) \, = \, [1-P(y_i=c_i)]} \, + \, c_i \, \underbrace {\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right)}_{P(y_i=c_i)} \ =

\left[ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \right] \underbrace {\left[ \Phi \left( \frac {\mathbf x_i \boldsymbol \beta - c_i}{\sigma} \right) \right]}_{P(y_i>c_i)} \, + \, c_i \, \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \ =

\mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma} \right) \, + \, c_i \, \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.16}


Example continued

Code
# simulated data - example from above continued
set.seed(1234890)
n <- 1500  # fewer observation to better judge efficiency 

# true parameters, which we want to estimate 
sigma <- 4
beta_0 <- 2
beta_1 <- 1;

x <- rnorm(n, mean = 0, sd = 2); 
eps <- rnorm(n, sd = sigma)

# the true model 
y <- beta_0 + beta_1 * x + eps

d <- data.frame(y = y, x = x)
library(truncreg) # Truncated Regression package
library(censReg) # Censored Regression package

## truncated y = yt, censored y = yc, at c = 1
d$yt <- ifelse(d$y > 1, d$y, NA)
d$yc <- ifelse(d$y > 1, d$y, 1)

## compare estimates for full/truncated/censored data
# ols full data 
full <- lm(y ~ x, data = d)

# ols truncated data
ols <- lm(yt ~ x, data = d) 

# ols censored data
olsc <- lm(yc ~ x, data = d)

# MLE truncated data
trunc <- truncreg(yt ~ x, data = d, point = 1, direction = "left")

# MLE censored data
censored <- censReg(yc ~ x, data = d, left = 1) 

summary(censored)
      
      Call:
      censReg(formula = yc ~ x, left = 1, data = d)
      
      Observations:
               Total  Left-censored     Uncensored Right-censored 
                1500            594            906              0 
      
      Coefficients:
                  Estimate Std. error t value Pr(> t)    
      (Intercept)  2.08976    0.11893   17.57  <2e-16 ***
      x            0.90929    0.05969   15.23  <2e-16 ***
      logSigma     1.38363    0.02540   54.47  <2e-16 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Newton-Raphson maximisation, 5 iterations
      Return code 1: gradient close to zero (gradtol)
      Log-likelihood: -2975.228 on 3 Df

Code
modelsummary( list("OLS full data"=full, "OLS trunc data"=ols, "MLE trunc data"=trunc,
                   "OLS cens data"=olsc, "MLE cens data"=censored), 
              output="flextable", 
              gof_map = c("nobs", "aic", "bic", "rmse"), 
              align = "lddddd", 
              stars = TRUE, 
              fmt = 3)
Table 10.4:

Simulated truncated and sensored data, estimated with OLS and MLE. True values are b0=2, b1=1 and sigma=4. Note, we now have only 1500 observations, compared to 10000 in the previous example. This shows that censored reg is more efficient than truncated reg

OLS full data

OLS trunc data

MLE trunc data

OLS cens data

MLE cens data

(Intercept)

   2.039***

   4.716***

   2.305***

   3.352***

   2.090***

  (0.104)  

  (0.096)  

  (0.419)  

  (0.071)  

  (0.119)  

x

   0.964***

   0.397***

   0.827***

   0.526***

   0.909***

  (0.053)  

  (0.050)  

  (0.116)  

  (0.036)  

  (0.060)  

sigma

   3.920***

  (0.201)  

logSigma

   1.384***

  (0.025)  

Num.Obs.

1500      

 906      

 906      

1500      

1500      

AIC

8439.7     

4391.6     

4117.3     

7274.4     

5956.5     

BIC

8455.6     

4406.0     

4131.8     

7290.3     

5972.4     

RMSE

   4.02    

   2.72    

   3.55    

   2.73    

+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001


10.4.3 Tobit model

  • Strictly speaking, the Tobit model, Tobin (1958), is formally identical to a censored normal regression model. However, there are two distinct features, one minor and one important

    • In most cases, the Tobit model is formulated with left-censoring and c_i=0. This is not important and can easily be changed

    • More importantly, the Tobit model is based on an optimization model with corner solutions and not on censoring of data above or below a certain threshold due to certain survey designs

    • This different starting point has serious consequences for interpreting the estimation results, especially for calculating the relevant partial effects

  • Let us consider an example for an underlying optimization problem with corner solutions for some agents i which maximize their utility U(y_i, z_i) subject to a budget constraint p_yy_i+p_zz_i \leq I_i

  • y_i and z_i could be leisure or working time versus consumption of goods, or consumption of tobacco versus other goods, or consumption today versus consumption tomorrow, ect. The critical point are additional non-negativity constraints; it is not possible to choose negative values, hence, y_i≥0 and z_i≥0


Without the non-negativity constraints the solution of this optimization problem for y_i is (assuming a linear solution which is to be expected with certain utility functions)

y_i^* = f(p_y, p_z, I_i, u_i) = \mathbf x_i \boldsymbol \beta + u_i

  • Here, y_i^* represents the potential choice, without non-negativity constraints and u_i represents unobserved heterogeneity of the individual utility function. However, invoking the non-negativity constraints, we arrive to a standard censored regression model, with y_i being the observed values

y_i^* = \mathbf x_i \boldsymbol \beta + u_i \\ y_i = y_i^* \quad \text{if } \ y_i^* > 0 \\ y_i = 0 \quad \ \text{if } \ y_i^* ≤ 0

  • The difference to the standard censored regression model is that in this case, the non-negativity arises because the agent simply can not choose negative values of a good - and not because the data are censored for some reasons

This has consequences for the interpretation of the model. In the standard censored regression model we are interested in the population parameter \boldsymbol \beta, which, in the case of censored or truncated data can not be consistently estimated by OLS. Therefore, we need a maximum likelihood approach

In the Tobit model, with an underlying optimization problem with non-negativity constraints, we are typically interested on how much the value of the actual observed variable y (labor supply in hours or expenditures on cars or tobacco) changes due to changes of the explanatory variables.

Partial effects

In particular, there are actually three different partial effects present in model

  1. The effect on the expectation of the latent variable – therefore on potential choices (without non-negativity restriction) and not on actual choices. Typically, this is not what we want in this case

\dfrac {\partial E(y^*|\mathbf x)}{\partial x_j} = \beta_j

  1. The effect on expected actual choices y, but only for uncensored observations, e.g., the choice of already employed people on their hours worked if wage is changing, or the choice about expenditures on tobacco for people which actually are smokers. Sometimes, this is what we want

\dfrac {\partial E(y|\mathbf x, y>0)}{\partial x_j} = \ ?

  1. This is the effect on expected actual choices y, but for all actual values, independently whether they are censored or not. Most of the time, this is what we want

\dfrac {\partial E(y|\mathbf x, y≥0)}{\partial x_j} = \dfrac {\partial E(y|\mathbf x)}{\partial x_j}= \ ?


Hence, typically, it is the last partial effect we are interested in. For instance, if the wage or the number of children is changing, we want to know the effect on the supplied hours worked coming form already employed as well as form people which are not currently working.

  • Therefore, the last partial effect also incorporates the effect on the probability for doing something or not (e.g., working, smoking, driving with a car, etc.)

\dfrac {\partial P(y_i>0|\mathbf x_i)}{\partial x_j} \, = \, \dfrac {\partial \Phi(\mathbf x_i \boldsymbol \beta/\sigma)}{\partial x_j} \, = \, \dfrac {\beta_j}{\sigma} \phi(\mathbf x_i \boldsymbol \beta/\sigma) \tag{10.17}

We already calculated the expectation E(y_i| \mathbf x_i, y_i≥0), (Equation 10.16). But here, we are dealing with the case of c_i=0, so the third term in Equation 10.16 is zero

E(y_i| \mathbf x_i, y_i≥0) \ = \ E(y_i| \mathbf x_i) \ = \ \mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \ =

\Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \left[ \mathbf x_i \boldsymbol \beta \, + \, \sigma \, \lambda \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \right]

  • The surprisingly simple partial effect is

\dfrac {\partial E(y_i|\mathbf x_i, y_i≥0)}{\partial x_j} \ = \ \beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.18}

  • Hence, the corresponding coefficient \beta_j has to be multiplied by a unit specific adjustment factor \Phi(\mathbf x_i \boldsymbol \beta / \sigma), which is clearly positive and less than 1

    • For calculating the APE, we use the average over all observations of this adjustment factor, i.e.,

    \widehat{A P E}_{j} = \hat\beta_{j} \, \frac{1}{n} \sum_{i=1}^{n} \Phi \left(\frac {\mathbf{x}_{i} \hat{\boldsymbol{\beta}}}{\sigma} \right) \tag{10.19}

    • For the PEA we use \hat\beta_{j} \, \Phi(\bar{\mathbf x} \boldsymbol \beta / \sigma)

Proof for the partial effect:
  • Note, that d \Phi(z) / d z = \phi(z) and d \phi(z) / d z = -z \phi(z)

\dfrac {\partial }{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \right ] \ =

\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) + \mathbf x_i \boldsymbol \beta \, \phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \frac {\beta_j}{\sigma} - \sigma\left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \frac {\beta_j}{\sigma} \ =

\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) + \mathbf x_i \boldsymbol \beta \, \phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right)\frac {\beta_j}{\sigma} - \left (\frac{\mathbf x_i \boldsymbol \beta} {1} \right) \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right)\frac {\beta_j}{\sigma} \ =

\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)


Sometimes, we only want the effect on the choices of already employed people on their hours worked in response to changing wages, or the choice about expenditures on tobacco for people which actually are smokers (the second partial effect mentioned above). In this case we need

E(y_i|\mathbf x_i, y_i>0) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )}

which we already encountered this in Equation 10.8, dealing with the truncated normal distribution. Remember, the function \lambda(z) is the inverse Mills ratio (Equation 10.7) and equals to \phi(z)/\Phi(z)

  • Then, the partial effect is

\dfrac {\partial}{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) }\right] \ = \beta_j \left[1 - \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) - \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)^2 \right] \tag{10.20}

  • Once again, the corresponding coefficient \beta_j has to be multiplied by a unit specific adjustment factor [ 1 - ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) - \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} )^2 ], which can be shown to be strictly between zero and one. For calculating the APE, we use the average over all observations of this adjustment factor. For the PEA we simply calculate the expression at the mean \bar{\mathbf x}

Proof for the partial effect:

Once again note, that d \Phi(z) / dz = \phi(z), \; d \phi(z) / d z = -z \phi(z) \, and \, \lambda(z)=\phi(z)/\Phi(z)

\text{ } \\ \dfrac {\partial}{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) }\right] \ =

\beta_j + \dfrac{- \sigma ( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma} \, \Phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) - \sigma \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma}\phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )} {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } =

\beta_j + \left[ - \dfrac{ \sigma ( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \frac {\beta_j}{\sigma}\Phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) } {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } - \dfrac{ \sigma \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma} \phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )} {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } \right] \ =

\beta_j \left[1 - \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) - \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)^2 \right]


10.4.4 Example

Now, we are estimating a concrete example using the MROZ data, which we have already studied in the context of Logit and Probit models. We are interested in the number of working hours supplied by married woman, hours, and the effects of several explanatory variables

  • These variables are nwifeinc = non wife income, educ = education in years, exper = experience in years, age, kidslt6 = number of kids younger than 6 and kidsge6 = number of kids older than 6

  • We estimate the model with OLS, Tobit and Probit. The reason for the Probit estimation is a special feature of the Tobit model:

    • The Tobit estimates implicitly incorporates the choice of doing something (for instance, the probability of working, depending on some variables) as well as the response of the intensity of doing something (e.g., the response of working time on education)

    • These two effects are: {\partial P(y_i>0|\mathbf x_i)}/{\partial x_j} = (\beta_j/\sigma) \phi(\mathbf x_i \boldsymbol \beta/\sigma) (Equation 10.12, Equation 10.17) and {\partial E(y_i|\mathbf x_i, y_i>0)}/{\partial x_j} = \beta_j [ 1 - ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) - \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} )^2 ] (Equation 10.20) and hence, are represented by the same parameter \beta_j times a positive constant

    • Therefore, every variable in the model affects the participation probability and the intensity equation with the same sign. Further, the Probit parameters should be approximately equal to \hat \beta_j/\hat \sigma in terms of the Tobit estimates. So, it is always a good idea to additionally estimate a corresponding Probit model (coding y_i > 0 as 1) and check whether the estimated coefficient actually have the same sign. If there are some coefficients with statistically significant different signs, then the Tobit model might not be appropriate


library(boot)
library(wooldridge) 
data("mroz")

# OLS
outols <- glm(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, x=TRUE) 

# Tobit model - default of `tobit` from `AER` package: left = 0. 
# You could also use `censReg` -> same result
outtobit <-  tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)


# with censReg - same result but logSigma is reported (but avg_slopes does not work)
#outtobit1 <-  censReg(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)


# you can calulate PEA
# margEff(outtobit1)
Code
########## for APEs for modelsummary() 

## bootstrap only for std.errors necessary 

bootbdy <- function(data, indices) {
   dat <- data[indices,]  # resampled data
   out <-  tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, 
                         data = dat) 
   m_tobit <- avg_slopes(out)
   fac <- mean( pnorm( predict(out)/out$scale ), na.rm = TRUE ) # scaling factor, eq. 10.7 
   m_tobit["estimate"] = m_tobit["estimate"]*fac
   return( coef(m_tobit) )
}

bootresult_ <- boot(data=mroz, statistic=bootbdy, R=500)

se_boot <- matrixStats::colSds(bootresult_$t) # for debugging, should be the same as mm

mm = summary(bootresult_)  # to calculate std.errors

## to use `modelsummary`, making a data.frame (with naming according broom package)
m_tobit1 = data.frame(term=names(bootresult_$t0), estimate=mm$original, std.error=mm$bootSE, statistic=mm$original/mm$bootSE, p.value= (1 - pnorm( abs(mm$original/mm$bootSE) ))*2  )

## modelsummary needs a special format of a list with tidy and glance
m_tobit <- list(  tidy = m_tobit1,  glance = data.frame(NA)  )
class(m_tobit) <- "modelsummary_list"

############## 



# Probit model - to check for misspecification
outprobit <- glm( inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, 
                  data = mroz, family = binomial(link = "probit"))


# Average partial effects for Probit with 'avg_slopes()'
marg_probit <- avg_slopes(outprobit)

# Average partial effects for Tobit with 'avg_slopes()', but without correction factor fac
marg_tobit <- avg_slopes(outtobit)

Code
rows <- tibble::tribble(~term, ~Bivariate, ~Multivariate, ~ab, ~cd, ~ef,
  'Empty row', '-', '-', '-', '-', '-', 
  'Another empty row', '?', '?', '?', '?', '?')
#attr(rows, 'position') <- c(1, 6)

modelsummary( list("OLS"=outols, "Tobit"=outtobit, "Tobit APE"=m_tobit, "Probit"=outprobit, "Probit APE"=marg_probit), 
              output="gt", 
              #statistic = c('std.error', 'statistic', 'p.value'), 
              #coef_omit = "Inter",
              gof_omit = "F|N", 
              align = "lddddd", 
              stars = TRUE, 
              fmt = 3,
              add_rows = rows
              ) 
OLS Tobit Tobit APE Probit Probit APE
(Intercept)  1330.482***   965.305*      0.270   
 (270.785)    (446.436)     (0.508)  
nwifeinc    -3.447       -8.814*      -5.189*     -0.012*     -0.004*  
   (2.544)      (4.459)      (2.593)     (0.005)     (0.001)  
educ    28.761*      80.646***    47.473***    0.131***    0.039***
  (12.955)     (21.583)     (12.649)     (0.025)     (0.007)  
exper    65.673***   131.564***    54.115***    0.123***    0.026***
   (9.963)     (17.279)      (4.945)     (0.019)     (0.002)  
I(exper^2)    -0.700*      -1.864***   -0.002** 
   (0.325)      (0.538)     (0.001)  
age   -30.512***   -54.405***   -32.026***   -0.053***   -0.016***
   (4.364)      (7.419)      (4.259)     (0.008)     (0.002)  
kidslt6  -442.090***  -894.022***  -526.278***   -0.868***   -0.261***
  (58.847)    (111.878)     (73.498)     (0.118)     (0.032)  
kidsge6   -32.779      -16.218       -9.547       0.036       0.011   
  (23.176)     (38.641)     (23.421)     (0.044)     (0.013)  
R2     0.266   
AIC 12117.1       7656.2       818.6       818.6     
BIC 12158.7       7697.8       855.6       855.6     
Log.Lik. -6049.534    -401.302    -401.302   
RMSE   746.18       953.68        0.42        0.42    
Left_censored   325      
Un_censored   428      
Right_censored     0      
Total   753      
Scale  1122.02*** 
Empty row     -           -           -          -          -      
Another empty row     ?           ?           ?          ?          ?      
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Figure 10.7: Number of working hours supplied by married woman, hours, and the effects of several explanatory variables, estimated with OLS, Tobit, Tobit-APEs (Equation 10.19) and a Probit model for checking signs of parameters

Code
#### Code for `texreg` procedure - not reported 
library(texreg)

# Calculating APEs for tobit model 
# Basically with: 
#   fav <- mean( pnorm( predict(outtobit)/outtobit$scale ), na.rm = TRUE  )
#   APE <- fav * coef(outtobit)

source('~/Dropbox/R/User Functions/extract.margins.R')   # user function for texreg
source('~/Dropbox/R/User Functions/extract.tobitMargins.R')   # user function for texreg

# APE for Tobit with bootstrapping for standard errors. User fn tobitAPE is integrated in extract.tobitMargins.R 
bootbody <- function(data, indices) {
   out <-  tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, 
                 data = mroz[indices,])
   return( tobitAPE(out) )
 }
bootresult <- boot(data=mroz, statistic=bootbody, R=1000)


# preparing for texreg to know tobit_output (append classes: "tobitMarg","APE")
tobit_out <- outtobit
class(tobit_out) <- c("tobitMarg","APE", "tobit", "survreg", "margins")


screenreg( list(outols, outtobit, tobit_out, outprobit, marg_probit ),
           custom.model.names = c("OLS", "Tobit", "Tobit APE", "Probit", "Probit APE"), 
            digits = 3, 
           bootresults=bootresult, discrete=c(9)
           )

  • Because of the different scaling factors involved, Tobit coefficients are not directly comparable to OLS coefficients. For instance, look at the OLS estimate for \beta_{educ} which is 28.8 and the Tobit estimate of 80.6. But this is the estimated effect on the latent variable, which is not really relevant in this context

  • To compare Tobit and OLS, one has to compare average partial effects, APEs, Equation 10.18. It turns out that the partial effects of Tobit and OLS are different not in all, but in a number of cases. For instance, the effect of education on E(y|\mathbf x), the expectation of observed hours worked, is estimated to be 47.5

  • Another difference between Tobit and OLS is that, due to the linearity of the model, OLS assumes constant partial effects, whereas partial effects are nonconstant in Tobit. To demonstrate this, look at Figure 10.8 below. There, the different predictors for hours worked are displayed, along with the observed points. The most relevant one is the blue line which represents E(y|\mathbf x). Note, how the blue line converges to the red one, the linear predictor \mathbf x \hat {\boldsymbol \beta} of the Tobit model, as the censoring becomes less imortant. Further note, the OLS predictor approximate E(y|\mathbf x) quite good for some range of \mathbf x

  • Lets us compare the estimates of the Tobit model with the Probit ones. It turns out that every coefficient has the same sign. Further, dividing the Tobit estimate for \beta_{educ} by the Tobit estimate for \sigma (which is 1122) we get 0.072 compared to 0.13 for Probit. Doing the same for \beta_{kidslt6} we get -0.8 compared to -0.87. This is not a huge difference but shows that the initial decision to work is stronger affected by education or small kids than the decision about the hours worked once in the labor force


Code
 tobitPredict <- function(x, model) {   # calculates pred values for tobit model
   pred <- pnorm( x/model$scale ) * x + model$scale * dnorm( x/model$scale )
  return(pred) }

 tobitPredict0 <- function(x, model) {   # calculates Lambda values for tobit model
   pred0 <- x + model$scale * dnorm( x/model$scale ) /  pnorm( x/model$scale )
  return(pred0) }

olsout2 <- lm(mroz$hours ~ outtobit$linear.predictors)

plot( outtobit$linear.predictors, mroz$hours, ylim=c(-1000, 5000), pch = 16, cex=0.9,
      col=c("#666666", "#666666", "#666666"), xlab="Linear tobit predictor X*beta",
      ylab="Hours and predictors of hours", 
      main = "Different predictors for hours worked" )
abline(c(0,0),c(1,1), col="red", lwd=2.5)
curve(tobitPredict(x, outtobit), from = -3000, to = 2200, n = 150, add = TRUE, 
      type = "l", col="blue", lwd = 4.5)
abline(olsout2, col="purple", lwd =2)
abline(0,0) 
curve(tobitPredict0(x, outtobit), from = -3000, to = 2200, n = 150, add = TRUE, 
      type = "l", col="cornflowerblue", lwd = 3)
text(-2500, 640, "E(y|y>0)")
text(-600, -950, "E(y*)")
text(-1100, 280, "E(y|y≥0) = E(y)")
text(-1950, -690, "OLS")

Figure 10.8: E(y|y≥0) (blue), linear tobit predictor X*beta (red), OLS predictor (purple) and E(y|y>0) (light blue)

10.5 Incidental truncation

Until now, we dealed with the problem of truncated and censored data which is directly linked to the dependent variable y and lead to a sample that is not random any more; this could renders OLS estimates inconsistent.

  • Remember, sample selection poses no problem if it is solely linked to exogenous variables or is exogenous to the errors 3

    • To formally show this result, define the indicator s_i, which is one if the observation i is in the sample and otherwise is zero. Then we have for s_i=1 (for fully observed observations, including \mathbf x_i)

s_i y_i \, = \, s_i \mathbf x_i \boldsymbol \beta + s_i u_i \ \ \Rightarrow

E(s_i y_i|\mathbf x_i, s_i=1) \, = \, E(s_i \mathbf x_i \boldsymbol \beta| \mathbf x_i, s_i=1) + E(s_i u_i|\mathbf x_i, s_i=1) \ \ \Rightarrow

E(y_i|\mathbf x_i, s_i=1) \, = \, \mathbf x_i \boldsymbol \beta + \underbrace {E(u_i|\mathbf x_i, s_i=1)}_{\text {should be } 0} \, = \, \mathbf x_i \boldsymbol \beta

  • So, if the expectation of u_i is not influenced by s, the conditional expectation of y is equal to \mathbf x_i \boldsymbol \beta and thus, \boldsymbol \beta can be consistently estimated by OLS with the available sample

    • The expectation of u_i is not influenced by s if

      1. s is only a function of the exogenous \mathbf x (exogenous sample selection), in which case the condition E(u_i|\mathbf x_i, s_i=1)=0 reduces to the usual condition E(u_i|\mathbf x_i)=0

      2. s is purely random (random sample selection), in which case the condition E(u_i|\mathbf x_i, s)=0 is met

      3. s is a mixture of both cases above


Incidental truncation or sample selection is a mechanism where sample selection is only indirectly linked to the dependent variable y or its error u; that means via other variables.

  • Very often but not always, this is due to self-selection

  • The leading example is the following: We want to estimate the person’s market wage what she could command in the labor market. This market wage would depend on her education, experience, age, etc. But if the wage offered by the market is too low, meaning that the market wage is lower than her reservation wage, she will not work at all and her potential market wage is not observed

    • This is clearly a self-section mechanism and a first stage decision determines whether the wage for this particular person is observed or not. The sample is not random any more and in particular, is linked to the unobserved reservation wage

    • People with low potential market wage or high reservation wage might therefore be heavily underrepresented in the sample

      • And if the factors deciding labor force participation influence the potential market wage as well we will get biased results regarding the factors determining the market wage
  • A solution to this problem is to define a separate model describing the decision for working versus non-working to get a clue whether a particular person will probably work and try to make an adjustment for the underrepresented cases

  • Hence, we will have two models, a selection or participation model and the outcome model (or sometimes called intensity model) which we are primarily interested in


10.5.1 Heckit model

The selection model (participation model) can be modeled as a simple Probit model; we want to estimate the conditions, from which on the person is expected to work (or more generally to participate). This is the case when the probability for working is larger than for non-working. Naturally, this leads to a probit model:

s_i^*=\mathbf z_i \boldsymbol \gamma + e_i, \ \ \ \ s_i = \begin{cases} 1 \ \text{ if } \ s_i^*>0 \\ 0 \ \text{ if } \ s_i ≤ 0 \end{cases} \tag{10.21}

  • Here, s_i^* is a latent variable, which in our example could be the difference between potential market wage and reservation wage

  • We predict labor market participation, s_i=1, if s_i^*>0, i.e., the market wage is larger than the reservation wage. Clearly, we predict out of labor force, s_i=0, if s_i^*≤0. In particular, s_i=1 if

e_i > - \mathbf z_i \boldsymbol \gamma \tag{10.22}

  • The vector \mathbf z_i contains the explanatory variables for participation in the labor market (in our example the number of kids, income of husband, age, etc.), \boldsymbol \gamma is the corresponding parameter vector to be estimated and e_i represents unobserved heterogeneity of the utility functions and especially, information about the unobserved reservation wage

The equation of primary interest is the outcome equation (sometimes called intensity equation), in our example the wage equation:

y_i^* = \mathbf x_i \boldsymbol \beta + u_i, \ \ \ y_i = \begin{cases} y_i^* \ \text{ if } \ s_i=1 \\ \text{unobserved if } \ s_i = 0 \end{cases} \tag{10.23}

  • Here, y_i^* is the latent dependent variable, in our example the potential market wage, \mathbf x_i are the explanatory variables and u_i represents unobserved heterogeneity of the individuals which too affects the market wage

  • The observation rule says that y_i is only observed if s_i=1, i.e. s_i^*>0, which means participation in the labor market in our example. The value of s_i is determined in the first stage decision problem, the selection equation

  • Note, we assume that \mathbf z_i is observed for all i in the sample, as otherwise we couldn’t estimate the selection problem

  • According to the selection model (Equation 10.21), besides \mathbf z_i, the unobserved error e_i determines whether the market wage for person i is observed or not. Therefore, we have an incidental sample selection because these factor are not directly linked to y_i or u_i


We assume that the errors from the section and outcome equations, e and u, are jointly normal distributed with correlation \rho. This means that the joint distribution is solely described by the mean vector and the covariance matrix and thus rules out any nonlinear relationships between e and u:

\binom{e}{u} \ \sim \ N\left ( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 1 & \rho \, \sigma_u \\ \rho \, \sigma_u & \sigma_u^2 \end{bmatrix} \right ) \tag{10.24}

  • Note, \sigma_e^2 = 1 is no restriction of generality as this variance is not identified in a Probit model and thus can be arbitrarily normalized (we already discussed this point in the chapter on binary choice models)

  • We assume, that e_i as well as u_i are independent of (\mathbf x_i, \mathbf z_i), thus E(u_i | \mathbf x_i, \mathbf z_i)=0 and E(e_i | \mathbf x_i, \mathbf z_i)=0, MLR.4 (see Section 2.7)

  • Whether or not OLS applied to the wage equation leads to consistent and asymptotically normal distributed estimates of \boldsymbol \beta depends on whether the selection mechanism via e_i is exogenous (uncorrelated) with respect the error term u_i (and thus with y_i)

  • Therefore, the correlation coefficient \rho is crucial for this problem. If \rho \neq 0, a sample selection bias arises


To show the importance of \rho we decompose u_i into its conditional expectation on e_i and an error term \epsilon_i, which is uncorrelated with e_i by construction and uncorrelated with \mathbf x_i, \mathbf z_i by assumption (as u_i is independent of (\mathbf x_i, \mathbf z_i))

u_i \, = \, E(u_i|e_i) + \epsilon_i, \quad \text{with } \, E(\epsilon_i|\mathbf x_i, \mathbf z_, e_i)=0

  • Because (u_i,e_i) are jointly normal, E(u_i|e_i) is equal to the linear projection (regression) of u_i on e_i, with the projection coefficient \sigma_{u,e}/\sigma_e^2 (with \sigma_{u,e} = covariance of u and e)

    • This reduces to \sigma_{u,e}, as \sigma_e^2 is normalized to one. Considering the definition of the correlation coefficient, \rho := \sigma_{u,e}/(\sigma_u \sigma_e), we get

u_i \, = \, \sigma_{u,e}e_i + \epsilon_i \, = \, \rho \, \sigma_u e_i + \epsilon_i

y_i^* = \mathbf x_i \boldsymbol \beta + \rho \, \sigma_u e_i + \epsilon_i \tag{10.25}

  • Taking the conditional expectation of Equation 10.25 for the observed sample (s_i=1 \ \Rightarrow \ e_i>-\boldsymbol \gamma \mathbf z \,) we get (note, e_i \perp (\mathbf x_i, \mathbf z_i) by assumption)

E(y_i^*|\mathbf x_i, \mathbf z_i, s_i=1) \ = \ \mathbf x_i \boldsymbol \beta \, + \, \rho \sigma_u E(e_i|s_i=1) \, + \, \underbrace {E(\epsilon_i|\mathbf x_i, \mathbf z_i, \overbrace {s_i=1}^{e_i>-\boldsymbol \gamma \mathbf z})}_{= \,0, \text{ as } \epsilon_i \, \perp \, (\mathbf x_i, \mathbf z_i,e_i) }


According to the selection model (Equation 10.21), the term before the last is:

\rho \sigma_u E(e_i|s_i=1) \, = \, \rho \sigma_u E(e_i|s_i^*>0) \, = \, \rho \sigma_u E(e_i|e_i>-\mathbf z_i \boldsymbol \gamma)

  • But this is the expectation of a truncated standardized normal distribution and hence, we can apply the theorem about truncated normal distributions, Theorem 10.1 (with y=e, \mu=0, c=-\mathbf z_i \boldsymbol \gamma and \sigma_{(e)}=1)

    \rho \sigma_u E(e_i|e_i>-\mathbf z_i \boldsymbol \gamma) \, = \, \rho \sigma_u \lambda(\mathbf z_i \boldsymbol \gamma)

    with \lambda(\mathbf z_i \boldsymbol \gamma) being the inverse Mills Ratio (see definition in Equation 10.7)

Putting things together, we finally arrive to the conditional expectation of observed y_i, which is the population regression in this case:

E(y_i|\mathbf x_i, \mathbf z_i, s_i=1) \, = \, \mathbf x_i \boldsymbol \beta + \rho \sigma_u \lambda(\mathbf z_i \boldsymbol \gamma) \tag{10.26}

  • Note, the similarities to Equation 10.8, which is not accidental as this too is a case of a truncated regression problem

Regarding to our leading example, if the correlation of unobserved wage determinants and unobserved determinants of the work decision is positive (\rho>0), the women observed working will have higher wages than expected solely from their characteristics (= positive selection). And this effect is represented by \lambda(\mathbf z_i \boldsymbol \gamma), which is strictly positive

  • Furthermore, the correction factor \lambda(\mathbf z_i \boldsymbol \gamma) will be higher for observations with a low probability to be in the sample (because of a low \mathbf z_i \boldsymbol \gamma, \ \lambda'< 0), but are nonetheless in the sample because of a high e_i, which represents unobserved heterogeneity and is possibly correlates with error u_i

As Equation 10.26 show, if \rho \neq 0, regressing y_i solely on \mathbf x_i leads to a sample selection bias in form of an omitted variable bias; the omitted variable \lambda(\mathbf z_i \boldsymbol \gamma ) is a nonlinear function of \mathbf z_i and therefore probably correlated with \mathbf x_i (as \mathbf x_i is usually a part of \mathbf z_i)

  • In this case, there are unobserved factors affecting both the selection as well as the outcome equation. This is the cause of this type of sample selection bias

Finally, let us discuss another aspect more closely by means of our example. We argued that s_i^* in the selection equation could be interpreted as the difference between between market wage offers and reservation wage, (w_i - w_i^r). But according to the outcome Equation 10.23, the market wage offers are dependent on \mathbf x_i and on u_i. This implies two things

  1. Thus, the variables in the outcome equation \mathbf x_i obviously plays some role in the selection equation (via market wage offers). Therefore, the variables \mathbf x_i should be part of the variables in \mathbf z_i, the variables of the selection equation. And this inclusion of all \mathbf x in \mathbf z is generally recommended as standard procedure

  2. A very similar argument can be put forward for the unobserved factors of the outcome equation, u_i. These too influence the market wage offers and therefore the decision to work or not. Thus, these factors are also a part of e_i, the unobserved factors of the selection equation. Hence, the error terms u_i and e_i are expected to be positively correlated in our example, which implies \rho > 0


Estimation

The model introduced above is called Hackman model or Heckit, Heckman (1979). It is a generalization of the Tobit model. In fact, if you set \mathbf x_i = \mathbf z_i, u_i = e_i and censoring y_i to zero for cases not observing y_i, we arrive to the Tobit model. For this reason the Hackman model is also sometimes called Tobit II model.

  • There are two procedures to estimate the model, a 2-step procedure and maximum likelihood estimation. Surprisingly, the 2-step procedure is much more common in practice. This procedure is as follows

    1. Estimate the selection model (Equation 10.21) by probit. This determines if the dependent variable is expected to be observed. The probability for y_i to be observed is

    P(s_i=1|\mathbf z_i) = \Phi(\mathbf z_i \boldsymbol \gamma) \ \rightarrow \ \hat {\boldsymbol \gamma}

    1. Calculate the inverse Mills ratio using the estimated probit coefficients for every observation

    \hat\lambda({\mathbf z}_i \hat{\boldsymbol \gamma}) = \dfrac {\phi({\mathbf z}_i \hat{\boldsymbol \gamma})}{\Phi({\mathbf z}_i \hat{\boldsymbol \gamma})}

    1. Estimate the outcome Equation 10.23, augmented by the estimated inverse Mills Ratio as a correction factor, by OLS. This yields consistent estimates of \boldsymbol \beta. The coefficient \alpha is an estimate for \rho \sigma_u of Equation 10.26

    y_i \, = \, \mathbf x_i \boldsymbol \beta \, + \, \alpha \, \hat\lambda(\mathbf z_i \hat{\boldsymbol \gamma}) \, + \, error \tag{10.27}

    1. A complication arises with the calculation of correct standard errors for regression Equation 10.27. Besides an inherent heteroscedasticity of the error, one have to take into account that \hat \lambda is itself estimated with the help of the estimated parameters \hat {\boldsymbol \gamma} of the selection equation, which introduces additional variation and correlations across observations. For details see Green (2017), chapt. 19, p. 994

The whole procedure sounds a little bit complicated, but there are standard routines for both STATA and R. The corresponding R package is sampleSelection with the procedure heckit. These routines also calculate the correct standard errors for the 2-step procedure.

  • The estimated coefficient \hat \alpha of \hat \lambda in Equation 10.27 can be used to test for a sample selection bias. If this parameter is not statically significant different from zero, then there seems to be no sample selection bias as \hat \alpha is an estimate for \rho \sigma_u hence, \hat \alpha = 0 \Rightarrow \rho = 0

  • However, this whole 2-step procedure and testing for a sample selection should be carried out with great care. The reason is that there is an inherent identification problem; the variable \lambda is a nonlinear function (but over a wide range nearly a linear function) of the variables in \mathbf z. As we learned above, \mathbf x is generally contained in \mathbf z. Thus, we often have a server multicollinearity problem in the estimated outcome Equation 10.27, leading to very imprecise parameter estimates and powerless tests

  • To reduce the identification problem, we always should have variables in \mathbf z, which are not contained in \mathbf x, helping to identify the coefficients of \mathbf x in Equation 10.27 – in this case, \hat \lambda is not any more a nearly linear combination of the variables in \mathbf x

  • In practice, it is often hard to find variables that play a role in the selection model but credibly do not play a role in the outcome model. Erroneously omitting variables form the outcome model can lead to severe biases


Example

We argued, that \mathbf z should contain at least one variable which is not contained in \mathbf x. Let us discuss this aspect by means of our example, which is estimated below.

  • We are interested in the market wage offers of married woman. We assume that education and experience play an important role in this respect. So, this is our outcome equation

  • To avoid a sample selection problem (because we only observe wages of employed woman), we apply Heckman’s 2-step procedure. Thus, we additionally specify a selection equation for the decision to work, to estimate the correction factor \lambda. We argued above, that all variables of the outcome equation should be included in the selection equation because possible wage offers are clearly an important factor for the decision to work as well

  • Therefore, the selection model contains at least education and experience as explanatory variables. However, to better identify the parameters of the outcome equation, we additionally need variables that are important for the decision to work and do not play a role for the market wage offers

    • In this case, we find such variables; we added the non-wife income (income of the husband), the number of children under six and over six and age as explanatory variables to the selection model. These variables should have an effect on the reservation wage and therefore on the decision to work, but not on the wage offers in the market

library(sampleSelection)
library(wooldridge)
data("mroz")

# OLS estimation
ols <-  lm(log(wage) ~ educ + exper + I(exper^2), data = mroz)

# Heckit 2 Step model (default); the first line is the selection equation, 
# the second line the outcome equation
heck   <-  heckit(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
                log(wage) ~ educ + exper + I(exper^2),
                data = mroz) 
                
summary (heck)
      --------------------------------------------
      Tobit 2 model (sample selection model)
      2-step Heckman / heckit estimation
      753 observations (325 censored and 428 observed)
      15 free parameters (df = 739)
      Probit selection equation:
                   Estimate Std. Error t value Pr(>|t|)    
      (Intercept)  0.270077   0.508593   0.531  0.59556    
      nwifeinc    -0.012024   0.004840  -2.484  0.01320 *  
      educ         0.130905   0.025254   5.183 2.81e-07 ***
      exper        0.123348   0.018716   6.590 8.34e-11 ***
      I(exper^2)  -0.001887   0.000600  -3.145  0.00173 ** 
      age         -0.052853   0.008477  -6.235 7.61e-10 ***
      kidslt6     -0.868328   0.118522  -7.326 6.21e-13 ***
      kidsge6      0.036005   0.043477   0.828  0.40786    
      Outcome equation:
                    Estimate Std. Error t value Pr(>|t|)    
      (Intercept) -0.5781032  0.3050062  -1.895  0.05843 .  
      educ         0.1090655  0.0155230   7.026 4.83e-12 ***
      exper        0.0438873  0.0162611   2.699  0.00712 ** 
      I(exper^2)  -0.0008591  0.0004389  -1.957  0.05068 .  
      Multiple R-Squared:0.1569,    Adjusted R-Squared:0.149
         Error terms:
                    Estimate Std. Error t value Pr(>|t|)
      invMillsRatio  0.03226    0.13362   0.241    0.809
      sigma          0.66363         NA      NA       NA
      rho            0.04861         NA      NA       NA
      --------------------------------------------
# Heckit MLE model
heckml <-  heckit(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
                log(wage) ~ educ + exper + I(exper^2),
                data = mroz, method = "ML") 

Code
library(texreg)
screenreg( list(ols, heck, heckml),  digits = 4, custom.model.names = c("OLS", "Heckit 2Stp", "Heckit MLE"), 
           prefix = FALSE, include.selection = FALSE, include.rsquared=FALSE ) 
      
      =========================================================
                      OLS           Heckit 2Stp   Heckit MLE   
      ---------------------------------------------------------
      (Intercept)      -0.5220 **    -0.5781        -0.5527 *  
                       (0.1986)      (0.3050)       (0.2604)   
      educ              0.1075 ***    0.1091 ***     0.1084 ***
                       (0.0141)      (0.0155)       (0.0149)   
      exper             0.0416 **     0.0439 **      0.0428 ** 
                       (0.0132)      (0.0163)       (0.0149)   
      exper^2          -0.0008 *     -0.0009        -0.0008 *  
                       (0.0004)      (0.0004)       (0.0004)   
      invMillsRatio                   0.0323                   
                                     (0.1336)                  
      sigma                           0.6636         0.6634 ***
                                                    (0.0227)   
      rho                             0.0486         0.0266    
                                                    (0.1471)   
      ---------------------------------------------------------
      Adj. R^2          0.1509        0.1490                   
      Num. obs.       428           753            753         
      Censored                      325            325         
      Observed                      428            428         
      AIC                                         1693.7702    
      BIC                                         1758.5071    
      Log Likelihood                              -832.8851    
      =========================================================
      *** p < 0.001; ** p < 0.01; * p < 0.05

  • In our example, the coefficient of \lambda, the inverse Mills Ratio, is not significant. Thus, there is no significant sample selection bias. This is the reason, why the estimates of OLS and Heckit are very similar

  • In the table, the results of a maximum likelihood estimation are also reported. As mentioned above, the 2-step procedure is much more common in empirical work, although the MLE is more efficient, especially with regard to an estimate of \rho. The downside of MLE is that this is said to be more prone to specification errors, for instance with regard to the normal distribution assumption of the error terms

  • The likelihood function of the Heckman model consists of two parts: P(s_i=0), this is the Probit part, and the density of y_i, given s_i=1 times P(s_i=1)

    \ln L = \sum_{s_i=0} \ln (1 - \Phi(\mathbf z_i \boldsymbol \gamma)) + \sum_{s_i=1} \ln f(y_i|s_i=1) \Phi(\mathbf z_i \boldsymbol \gamma)

    As we have an incidental truncation, f(y_i|s_i=1) = f(y_i|e_i>-\mathbf z_i \boldsymbol \gamma) is a more complicated term based on a bivariate truncated normal distribution (see Green (2017), chapt.19, p. 995)

  • Last but not least, it should be noted that many of the methods for limited depended variables and sample selection discussed can be extended to 2SLS and some also to panel data (for details see Green (2017))


  1. For a truncation from the right (from above) we have: \lambda(x) =\dfrac {-\phi(x)}{\Phi(-x)}.↩︎

  2. Hint: Let \alpha = (\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma}). Then \lambda'(\alpha)=\frac {-\alpha \phi(\alpha) \Phi(\alpha) - \phi(\alpha)^2}{\Phi(\alpha)^2} = -\alpha \lambda(\alpha) - \lambda(\alpha)^2 = (-) and one can show that: -1<\lambda'(\alpha)<0.↩︎

  3. The following results carry over to IV-estimations as well, if E(u|\mathbf z, s)=0, with \mathbf z being the (external + internal) instruments.↩︎