Econometrics with R - 10 LimDepVars

Limited dependent variables (LDV) are variables which are substantively restricted in their domain, for instance:

Binary variables, e.g., employed/not employed
Non-negative variables, e.g., wages, prices, interest rates
- Non-negative variables with excess zeros, e.g., labor supply
Count variables, e.g., the number of arrests in a year
Censored variables, e.g., unemployment duration

We further have

Sample selection models

The sample used to infer population relationships is endogenously selected, e.g., a wage offer regression but wage data are only observable if people are actually working
- The resulting sample is not random any more and could lead to biased estimates

10.1 Binary dependent variable

10.1.1 The linear probability model (LPM)

Until now, we only analysed models with dummy variables on the right hand side – as explanatory variables.

Now, we analyze models with a binary dependent variable, i.e., a dummy variable as left hand side variable.

For instance, y may represents labor force participation of married women, which is one if the woman is in the labor force and is zero if not. If we assume a linear model we have

y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k + u

Under MLR.4, we have as usual (population regression line):

\Rightarrow \ E(y \mid \mathbf x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k = \, \mathbf x \boldsymbol \beta

As the dependent variable only takes on the values 0 and 1 we additionally have in this case:

E(y \mid \mathbf x) := \, 1 \cdot P(y=1 \mid \mathbf x) + 0 \cdot P(y=0 \mid \mathbf x)

\Rightarrow \ P(y=1 \mid \mathbf x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k \tag{10.1}

This is the linear probability model, LPM: The model estimates the probability that y=1; in our example that a married woman is part of the labor force.

The coefficient \beta_j describes the partial effect of the explanatory variable j on the probability that y=1

\beta_j = \dfrac{\partial P(y=1\mid \mathbf x)}{\partial x_j}

This model was already studied in some detail in Section 5.8. You should look there especially for the estimated example

Code

library(AER); library(effects); library(wooldridge)
data(mroz)

Code

# LPM for labor forece participation of married woman
outm <- lm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)

Code

# We calculate squared residuals and fitted values for plotting
ressq <-  outm$residuals^2
yhat <- outm$fitted.values

Measures of fit for binary dependent variables

A popular measure of fit for such models is the hit rate, which measures the portion of correctly predicted observations of the dependent variable (which has only zeros or ones)
- Thereby, whenever \hat y_i (which is the estimated probability that y_i=1) is larger than the cutoff 0.5, the model predicts y_i=1 and when \hat y_i <0.5, the model predicts y_i=0
Besides the overall hit rate, one can separately examine, how the hit rates are for values of y_i=1 and y_i=0
- The portion of correctly predicted y_is, when y_i=1, is called sensitivity. This is the ability of the model to detect true “positives” (e.g., the percentage of people predicted to be in the workforce and actually are or, as an example from medicine,
  - the percentage of sick people who are correctly identified by a tumor marker as having the condition)
- The portion of correctly predicted y_is, when y_i=0, is called specificity. This is the ability of the model to detect true “negatives” (e.g., the percentage of people predicted not to be in the workforce and actually are not or, as an example from medicine,
  - the percentage of healthy people who are correctly identified by a tumor marker as not having the condition)
Below, we calculate the hit rates described above for the a LPM, and for purposes of comparison, for a corresponding Probit model (discussed below). The results are nearly identical – (the code for this hit-rate function is listed below and could by used by any user)

Code

hitrate <- function(outm, cutoff=0.5, stragaz=TRUE) {
  
  # This function calculates overall hit rates, sensitivity and specificity for
  # LPM, Logit and Probit models
  # outm = output list of `lm()` or `glm()`
  # cutoff = threshold value for predicting 1, default = 0.5
  
  #requires package stargazer
  
  
  # load dependent variable of model
  y <- outm$model[[1]]

  # load fitted values
  yhat <- outm$fitted.values
  ly <- length(y)
  
  # prediction of one, if yhat > threshold
  ypred <- (yhat > cutoff)*1
  
  
  # y which are 1
  y_1 <- y[y==1]
  ly_1 <- length(y_1)

  # y which are zero
  y_0 <- y[y==0]
  ly_0 <- length(y_0)
  
  # prediction error
  prederror <- y - ypred
  
  # overall hits
  hit_o <- rep(0, ly)
  hit_o <- ifelse(  (ypred == 1 & y == 1) | (ypred == 0 & y == 0) ,1, 0)
  hitrate_o <- sum(hit_o)/ly
  
  # correct prediction if y=1
  hit_1 <- rep(0, ly)
  hit_1 <- ifelse((ypred == 1 & y == 1), 1, 0)
  hitrate_1 <- sum(hit_1)/ly_1
  
  # correct prediction if y=0
  hit_0 <- rep(0, ly)
  hit_0 <- ifelse((ypred == 0 & y == 0), 1, 0)
  hitrate_0 <- sum(hit_0)/ly_0
  
  # storing results in data frame
  results <- data.frame(HitRate = hitrate_o, Sensitivity = hitrate_1, Specificity = hitrate_0, 
                        Nof1s = ly_1, Nof0s = ly_0, check.rows=TRUE)
  
  if (stragaz==TRUE) {
    stargazer::stargazer(results, type = "text", summary = FALSE, rownames = FALSE)
  }
  
  return(results)
}

# Hit rate of LPM
hr_lpm <- hitrate(outm)


===========================================
HitRate Sensitivity Specificity Nof1s Nof0s
-------------------------------------------
0.734      0.818       0.625     428   325 
-------------------------------------------

# For comparison, estimate a corresponding Probit model 
# which we introduce below 

outprobit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, 
                 data = mroz, family = binomial(link = "probit"))

# Hit rate of Probit model
hr_probit <- hitrate(outprobit)


===========================================
HitRate Sensitivity Specificity Nof1s Nof0s
-------------------------------------------
0.734      0.813       0.631     428   325 
-------------------------------------------

10.1.2 Probit and Logit models

The main problems of the Linear Probability Model (due to the linearity in \mathbf x) are

Predictions sometimes lie outside the unit interval
Partial effects of explanatory variables are constant

An obvious solution for these problems is a nonlinear model for binary response

We assume that the expected response – the response probability – is a nonlinear continuous function G(\mathbf x) of the explanatory variables, with G'>0, G(-\infty) \rightarrow 0 and G(\infty) \rightarrow 1

E(y|\mathbf x) \, =\, P(y=1 | \mathbf{x}) \, = \, G\left(\beta_{0}+\beta_{1} x_{1}+\ldots+\beta_{k} x_{k}\right) \, = \, G(\mathbf{x} \boldsymbol{\beta}) \

G(\mathbf x) is the (inverse) link function which links the variables \mathbf x to the probability that y_i = 1
Obvious candidates for the function G() are cumulative distribution functions (cdf), which are increasing and map [-\infty, +\infty] into the unit interval [0,1]

The estimation model for the observable variable y therefore is

y \, = \, G(\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k) + u

The error term u of this model has the same properties as the one in the LPM, especially the errors are heteroscedastic with \operatorname{Var}(u_i | \mathbf x) = G(\mathbf x \boldsymbol \beta) (1 - G(\mathbf x \boldsymbol \beta)), (see Section 5.8)
Conditional on \mathbf x, the observations y_i are Bernoulli distributed
Regarding the choice of the link function there are two main models
- Probit model with the (inverse) link function: (with \phi(z) being the probability density and \Phi(z) being the cumulative distribution function (cdf) of the standard normally distribution, see Equation A.3 and Equation A.4)
G(z) \ = \ \Phi(z) := \, \int_{-\infty}^z \phi(v)dv \ = \, \int_{-\infty}^{\mathbf x \boldsymbol \beta} \phi(v)dv \ \ \quad \text{ :cdf of standard normal distribution}
- Logit model with the (inverse) link function:
G(z) \ = \ \Lambda(z) := \, \dfrac {e^z}{1 + e^z} \ = \ \dfrac {e^{\mathbf x \boldsymbol \beta}}{1 + e^{\mathbf x \boldsymbol \beta}} \ \ \ \quad \text{ :cdf of standard Logistic distribution}
Both distributions are symmetric and have quite similar shapes. However, the Logistic distribution has fatter tails - it’s nearly identical to a t(8)-distribution; therefore, outliers are somewhat more probable rendering the logistic estimates a little bit more robust against outliers

Latent variable interpretation

Both the Logit and Probit model have a latent variable formulation/interpretation

y^* = \mathbf x \boldsymbol \beta + e, \quad \text{ and } \ \ y_i = \begin{cases} 1 \ \text{ if } \ y_i^*>0 \\ 0 \ \text{ if } \ y_i^*\leq 0 \end{cases}

e \sim N(0,1) with the Probit model; and e \sim Logistic(0,\pi^2/3) with the Logit model
If the latent variable y^* is larger than zero, the observable y takes on the value 1, if it is less or equal zero, y takes on 0 (y^* can thus be interpreted as the propensity to have y = 1)
- The latent variable is often interpreted as net utility for doing something – Random Utility models. If this net utility is higher as some threshold (0 in our case), then the utility is higher than the opportunity costs (utility of other alternatives) and the action is taken, i.e. y=1
Note that there is rarely a well defined unit of measurement for y^*. Therefore, identification requires a scaling, which is done by setting the variance of e to one, respectively to \pi^2/3
We do not observe the utility of the action – y^* is a latent variable – we only observe whether the action is taken or not, y=1 or y=0. But this is the case if y^* > or \leq 0. In the following it is shown that this model of discrete choice leads to our estimation model

From the latent variable model it follows:

1 - G(-\mathbf x \boldsymbol \beta) \ \ = \text { (by symmetry) } \

G(\mathbf x \boldsymbol \beta) \ = \begin{cases} \Phi(\mathbf x \boldsymbol \beta) \ \text{ :Probit model } \\ \Lambda(\mathbf x \boldsymbol \beta) \ \text{ :Logit model } \end{cases}

Hence, the latent variable model serves as a theoretical foundation of our nonlinear estimation model, postulated at the begin of this section. We have shown that:

E(y \,| \, \mathbf x) \, = \, G(\mathbf x \boldsymbol \beta)

This leads to the estimation model

y \, = \, G(\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k) + u

Note, neither a threshold of 0 in the latent variable model nor \operatorname{Var}(e) = 1 or \pi^2/3 in the latent variable model is a restriction of generality. The former is reflected by the intercept, the latter is only a scaling factor of the variables. So, nether the threshold nor \operatorname{Var}(e) are of particular value; they are not identified and hence free to choose

10.1.3 The Gumbel model

Sometimes, a non-symmetric distribution seems more appropriate, for instance if the number of ones (success) and zeros (unsuccess) are very different in the sample. In this case the Gumbel distribution (an extreme value distribution) can be used as an interesting alternative link function G(z) \ =\, {e^{-e^{-z}}} \ = {e^{-e^{-\mathbf x \boldsymbol \beta}}} \ \ \ \ \quad \text{ :cdf of standard Gumbel distribution}
This is a right skewed distribution, which might be appropriate if the ones are the rare (extreme) outcomes. If the zeros are the rare outcomes one could use a left skewed distribution, the (mirrored) Complementary Gumbel distribution: 1-G(-z) \ =\, 1-e^{-e^{z}} \ = 1-e^{-e^{\mathbf x \boldsymbol \beta}} \ \ \ \ \quad \text{ :cdf of Complementary Gumbel distribution}
Interestingly, only this version is implemented with the R procedure glm(). So, if you want to use this left skewed distribution and the ones are the rare outcomes in your sample, you have to recode the y-s with the command: “ifelse(dataset$varname_of_y == 1, 0, 1)”, making ones to zeros and zeros to ones

10.1.4 LPM versus Probit

To show graphically the difference between the LPM and the Probit (or Logit) model we estimate the probability of denial of a bank credit (deny) in dependence of the ratio of monthly dept payments to monthly income (pirat)

library(visreg)  # for plotting partial effect
library(wooldridge); 
data("HMDA") # data set from of the wooldridge library

# convert `deny` to numeric. This is coded as a factor variable in the data set 
HMDA$deny <- as.numeric(HMDA$deny) - 1

# estimating linear probability model
denylpm <- lm(deny ~ pirat, data = HMDA)

To estimate the probit model, we use the gml() function in R - glm for generalized linear models. There are two additional option to set in this function:

family for the distribution of y; this is binomial in our case (binary dependent variable)
and you have to specify the link function G(\mathbf x \boldsymbol \beta) for the probit model

# estimating probit model
denyprobit <- glm(deny ~ pirat, family = binomial(link = "probit"), data = HMDA)

Code

visreg(denylpm, band = FALSE, ylim=c(-0.05, 1.05), main = "LPM", xlab = "P/I ratio")
abline(0,0, lty = 2, col = "darkred");  abline(1,0, lty = 2, col = "darkred")
text(2.5, 0.95, cex = 0.8, "Mortgage denied");  text(2.5, 0.05, cex= 0.8, "Mortgage approved")
points(deny~pirat, data = HMDA, pch = 20, cex.main = 0.85)

Figure 10.1: Estimated probability of denying a bank credit with a LPM

Code

visreg(denyprobit, scale = "response", band = FALSE, ylim=c(-0.05, 1.05), main = "Probit", xlab = "P/I ratio", rug=0)
abline(0,0, lty = 2, col = "darkred");  abline(1,0, lty = 2, col = "darkred")
text(2.5, 0.95, cex = 0.8, "Mortgage denied");  text(2.5, 0.05, cex= 0.8, "Mortgage approved")
points(deny~pirat, data = HMDA, pch = 20, cex.main = 0.85)

Figure 10.2: Estimated probability of denying a bank credit with a Probit model

10.1.5 Partial Effects

The two previous figures show that the predicted probabilities in probit model are always within the unit interval, contrary to the LMP
Furthermore, the partial effects of the explanatory variables are not constant any more (as in the LPM), but are nonlinear and depend on the values of \mathbf x
- On the one hand, this is a big theoretical advantage as linear partial effects are not possible in the context of predicting probabilities
- On the other hand, this is a practical disadvantage as the estimated coefficients \hat\beta are not easily interpretable any more. Basically, you have to relay on average partial effects, APE.
The partial effect of the explanatory variable x_j on the probability that y=1 is

\dfrac{\partial P(y=1| \mathbf x)}{\partial x_j} \ = \ \dfrac {\partial G(\mathbf x \boldsymbol \beta)}{\partial x_j} \ = \begin{cases} \dfrac {\Phi(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \phi(\mathbf x \boldsymbol \beta) \, \beta_j \ \text{ : Probit } \\ \dfrac {\Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \Lambda(\mathbf x \boldsymbol \beta) (1 - \Lambda(\mathbf x \boldsymbol \beta) ) \, \beta_j \ \text{ : Logit } \end{cases}

For the probit case, \phi is the density function of the normal distribution and we have

\dfrac {\partial \Phi(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \, \dfrac {\partial}{\partial \beta_j} \int_{-\infty}^{\mathbf x \boldsymbol \beta} \phi(v)dv \ = \ \phi(\mathbf x \boldsymbol \beta) \beta_j

For the logit case, applying the Quotient rule, we have

\dfrac {\partial \Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \dfrac {\partial}{\partial \beta_j} \left( \dfrac {e^{\mathbf x \boldsymbol \beta}}{1 + e^{\mathbf x \boldsymbol \beta}} \right) \ = \ \dfrac {e^{\mathbf x \boldsymbol \beta}}{ (1 + e^{\mathbf x \boldsymbol \beta})^2} \, \beta_j

After some manipulations, we eventually get

\dfrac {\partial \Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \Lambda'(\mathbf x \boldsymbol \beta) \, \beta_j \ = \ \Lambda(\mathbf x \boldsymbol \beta) \, (1 - \Lambda(\mathbf x \boldsymbol \beta) ) \, \beta_j

In both cases the partial effect of x_j on the P(y=1 | \mathbf x) is not equal to \beta_j (which is partial effect on the latent variable y^*), but \beta_j must be multiplied by the corresponding density function, evaluated at \, \mathbf x \boldsymbol \beta, i.e., at particular values of \mathbf x
Hence, the partial effects are nonlinear and depend on the level of \mathbf x
- Remark: The formulas for the partial effects just derived are only valid for continuous or nearly continuous variables

One can calculate APE (average partial effects) for instance with the R package marginaleffects, or PEA (partial effects at average)
- The former is the partial effect calculated for each observation \mathbf x_i and than averaged. For the Probit model this is: \widehat{A P E}_{j} =\left( \frac{1}{n} \sum_{i=1}^{n} \phi\left(\mathbf{x}_{i} \hat{\boldsymbol{\beta}}\right) \right) \hat\beta_{j}
- The latter is calculated at the mean of \mathbf x, \bar {\mathbf x}. For the Logit model this is \, \widehat{P E A_{j}} = \Lambda' (\bar{\mathbf{x}} \hat{\boldsymbol{\beta}}) \hat\beta_{j}
- PEAs could be somewhat problematic in the case of dummy variables. For instance, if you have a dummy variable for gender in your model, averaging to 0.5 might be questionable
For a discrete explanatory variable x_k, e.g., the number of years of education or gender, the APE can directly be computed (for the Logit model replace \Phi by \Lambda)

\widehat{A P E}_{k} \ = \ n^{-1} \sum_{i=1}^n \left( \Phi\left[\beta_{0}+\beta_{1} x_{i1}+\ldots+\beta_{k}\left(c_{ik}+1\right) \right] \ - \ \Phi \left[\beta_{0}+\beta_{1} x_{i1}+\ldots+\beta_{k} c_{ik} \right] \right)

The R package marginaleffects apply this formula automatically to factor variables

10.1.6 Estimation

As both models are highly nonlinear, OLS can not be applied. In principle, NLS (nonlinear last squares) could be applied, but because of the heteroscedasticity of the error terms this would not be efficient
Maximum Likelihood estimation (MLE) is the main procedure
- Thereby, we are looking for coefficients \boldsymbol \beta so that the observed sample has the highest probability
- Each observation i is treated as a single and independent draw from a Bernoulli distribution (binominal with one draw)
- We have a success probability for observation i: \text{ } (remember: G=\Phi for probit and G=\Lambda for logit) P(y_i=1 \, | \, \mathbf x; \boldsymbol \beta) = G(\mathbf x_i \boldsymbol \beta)
- and the non-success probability for observation i:
  P(y_i=0 \, | \, \mathbf x; \boldsymbol \beta) = 1-G(\mathbf x_i \boldsymbol \beta)
This leads to the joint probability function (for all observations)

P(y_1, \ldots, y_n \, | \, \mathbf x; \boldsymbol \beta) \ = \ \underset {y=0} {\prod_{i=1}^n} \left[ 1-G(\mathbf x_i \boldsymbol \beta) \right] \, \underset {y=1} {\prod_{i=1}^n} G(\mathbf x_i \boldsymbol \beta)

This can be conveniently rewritten to the likelihood function of the sample for the n observations

L(\boldsymbol \beta \ | \ \text{data}) \ = \ \prod_{i=1}^n \underbrace {\left[ G(\mathbf x_i \boldsymbol \beta) \right]^{y_i} \, \left[1-G(\mathbf x_i \boldsymbol \beta) \right]^{1-y_i}}_{= f(y_i | \mathbf x, \boldsymbol \beta), \, \text{ conditional density of } y_i}

(Note, if y_i=0 then 1-y_i=1 and if y_i=1 then 1-y_i=0)
Taking logs we get the log-likelihood function

\ln L(\boldsymbol \beta \ | \ \text{data}) \ = \ \sum_{i=1}^n \underbrace {\left\lbrace y_i \ln \left[ G(\mathbf x_i \boldsymbol \beta) \right] \, + \, (1-y_i) \ln \left[1-G(\mathbf x_i \boldsymbol \beta) \right] \right \rbrace}_{\ln f(y_i | \mathbf x, \boldsymbol \beta)}

We maximize \ln L(\boldsymbol \beta \, | \, \text{data}) with respect to \boldsymbol \beta such that the observed sample has the highest probability. This leads to the likelihood equations (the counterparts to the normal equations of OLS)

\dfrac { \partial \ln L}{\partial \boldsymbol \beta} \ = \ \sum_{i=1}^n \left[ y_i \dfrac{g(\mathbf x_i \boldsymbol \beta)}{G(\mathbf x_i \boldsymbol \beta)} \ + \ (1-y_i) \dfrac{-g(\mathbf x_i \boldsymbol \beta)}{(1-G(\mathbf x_i \boldsymbol \beta)} \right] \mathbf x_i \ = \ \mathbf 0, \quad \text{ with } g := G'

After some manipulations, for the Logit case, these likelihood equation reduce to

\dfrac { \partial \ln L}{\partial \boldsymbol \beta} \ = \ \sum_{i=1}^n [ y_i - \Lambda(\mathbf x_i \boldsymbol \beta) ] \mathbf x_i \ = \ \mathbf 0 \

Noting that [y_i - \Lambda(\mathbf x_i \boldsymbol \beta)] are the residuals of the model, the similarities to the OLS normal equations are obvious (compare Equation 1.9)
If \boldsymbol \beta contains an intercept, the first likelihood equation is

\dfrac { \partial \ln L}{\partial \beta_0} \ = \ \sum_{i=1}^n [ y_i - \Lambda(\mathbf x_i \boldsymbol \beta) ] \mathbf \ = \ 0 \ \ \ \Rightarrow \ \ \ \frac{1}{n} \sum_{i=1}^n y_i = \frac{1}{n} \sum_{i=1}^n \Lambda(\mathbf x_i \boldsymbol \beta)
- Hence, portions of ones in the sample equals the the average of the predicted probabilities
Interestingly, the likelihood equations for the Probit case don’t look so familiar, involving \phi/\Phi, the inverse Mills ratio
In both cases, the likelihood equations are highly nonlinear and have to be solved numerically, for instance with the Newton-Raphson algorithm

10.1.7 Example

We once more examine a model for the probability of labor force participation of married woman, inlf (for in labor force)
The explanatory variables are: nwifeinc (additional family income), educ, exper, exper^2, age, kidslt6 (number of children under 6 years) and kidsge6 (number of children over 6 years)
This time, we additionally estimate Logit and Probit models. Furthermore, we used the R-package marginaleffects to calculate the implied APE
The results presented below are pretty similar for the three models
- Note, you should only directly compare the APE
- As shown previously, the coefficients of Logit and Probit models are non-linear transformations of the partial effects
  - The Logit coefficients are typically 1.6 to 1.8 times larger than the Probit coefficients
  - Probit coefficients are typically about 3.3 times larger than the ones of the LPM
- Fortunately, the R-package marginaleffects is doing the work for us and calculates the appropriate APE
All estimated coefficients have the expected sign
- Once more, of special interest is the variable kidslt6, the number of children under 6 years. The estimated APE is about -0.26 meaning that, for an average woman, an additional young child reduces the probability to be in the labor force by 26%
But for Logit and Probit models, this partial effect depends in a nonlinear manner on the values of the explanatory variables. This is a main advantage compared to the LPM. The following figures show these nonlinear effects of the number of children under 6 on labor participation

library(wooldridge) 
library(marginaleffects)
library(modelsummary)

data("mroz")

# LPM model 
outlpm <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)

#Logit model
outlogit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, 
                family = binomial(link = "logit"))

# Probit model
outprobit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, 
                 family = binomial(link = "probit"))

# Average partial effects with avg_slopes() from marginaleffects package
marg_lpm <- avg_slopes(outlpm)
marg_logit <- avg_slopes(outlogit)
marg_probit <- avg_slopes(outprobit)

summary(outlogit)

      
      Call:
      glm(formula = inlf ~ nwifeinc + educ + exper + I(exper^2) + age + 
          kidslt6 + kidsge6, family = binomial(link = "logit"), data = mroz)
      
      Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
      (Intercept)  0.425452   0.860365   0.495  0.62095    
      nwifeinc    -0.021345   0.008421  -2.535  0.01126 *  
      educ         0.221170   0.043439   5.091 3.55e-07 ***
      exper        0.205870   0.032057   6.422 1.34e-10 ***
      I(exper^2)  -0.003154   0.001016  -3.104  0.00191 ** 
      age         -0.088024   0.014573  -6.040 1.54e-09 ***
      kidslt6     -1.443354   0.203583  -7.090 1.34e-12 ***
      kidsge6      0.060112   0.074789   0.804  0.42154    
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      (Dispersion parameter for binomial family taken to be 1)
      
          Null deviance: 1029.75  on 752  degrees of freedom
      Residual deviance:  803.53  on 745  degrees of freedom
      AIC: 819.53
      
      Number of Fisher Scoring iterations: 4

summary(outprobit)

      
      Call:
      glm(formula = inlf ~ nwifeinc + educ + exper + I(exper^2) + age + 
          kidslt6 + kidsge6, family = binomial(link = "probit"), data = mroz)
      
      Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
      (Intercept)  0.2700736  0.5080782   0.532  0.59503    
      nwifeinc    -0.0120236  0.0049392  -2.434  0.01492 *  
      educ         0.1309040  0.0253987   5.154 2.55e-07 ***
      exper        0.1233472  0.0187587   6.575 4.85e-11 ***
      I(exper^2)  -0.0018871  0.0005999  -3.145  0.00166 ** 
      age         -0.0528524  0.0084624  -6.246 4.22e-10 ***
      kidslt6     -0.8683247  0.1183773  -7.335 2.21e-13 ***
      kidsge6      0.0360056  0.0440303   0.818  0.41350    
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      (Dispersion parameter for binomial family taken to be 1)
      
          Null deviance: 1029.7  on 752  degrees of freedom
      Residual deviance:  802.6  on 745  degrees of freedom
      AIC: 818.6
      
      Number of Fisher Scoring iterations: 4

Code

# Modifications for modelsummery to print HitRates for glm()
library(broom)
# untereinander
glance_custom.glm <- function(x, ...) {
  HRate <- round( hitrate(x, stragaz=FALSE)$HitRate, 2 )
  Sens <-  round( hitrate(x, stragaz=FALSE)$Sensitivity, 2 )
  Spec <-  round( hitrate(x, stragaz=FALSE)$Specificity, 2 )
  Ratio <- round( hitrate(x, stragaz=FALSE)$Nof1s / hitrate(x, stragaz=FALSE)$Nof0s, 2 ) 
  out <- data.frame("HitRate" = HRate, 
                    "Sensitifity" = Sens,
                    "Specificity" = Spec,
                    "Ratio 1/0" = Ratio)
  return(out)
}

Code

modelsummary( list(" LPM "=outlpm, " LPM APE "=marg_lpm, " Logit "=outlogit, 
                   " Logit APE "=marg_logit, " Probit "=outprobit, 
                   " Probit APE "=marg_probit), 
              output="flextable", 
              gof_omit = "F", 
              align = "ldddddd", 
              stars = TRUE, 
              fmt = 4)

Table 10.1:
Probability for labor force participation of woman, several different models
	LPM	LPM APE	Logit	Logit APE	Probit	Probit APE
(Intercept)	0.5855***		0.4255		0.2701
	(0.1542)		(0.8604)		(0.5081)
nwifeinc	-0.0034*	-0.0034*	-0.0213*	-0.0038*	-0.0120*	-0.0036*
	(0.0014)	(0.0014)	(0.0084)	(0.0015)	(0.0049)	(0.0015)
educ	0.0380***	0.0380***	0.2212***	0.0395***	0.1309***	0.0394***
	(0.0074)	(0.0074)	(0.0434)	(0.0073)	(0.0254)	(0.0073)
exper	0.0395***	0.0268***	0.2059***	0.0254***	0.1233***	0.0256***
	(0.0057)	(0.0025)	(0.0321)	(0.0022)	(0.0188)	(0.0022)
I(exper^2)	-0.0006**		-0.0032**		-0.0019**
	(0.0002)		(0.0010)		(0.0006)
age	-0.0161***	-0.0161***	-0.0880***	-0.0157***	-0.0529***	-0.0159***
	(0.0025)	(0.0025)	(0.0146)	(0.0024)	(0.0085)	(0.0024)
kidslt6	-0.2618***	-0.2618***	-1.4434***	-0.2578***	-0.8683***	-0.2612***
	(0.0335)	(0.0335)	(0.2036)	(0.0319)	(0.1184)	(0.0319)
kidsge6	0.0130	0.0130	0.0601	0.0107	0.0360	0.0108
	(0.0132)	(0.0132)	(0.0748)	(0.0133)	(0.0440)	(0.0132)
Num.Obs.	753	753	753	753	753	753
R2	0.264	0.264
AIC	865.8	865.8	819.5	819.5	818.6	818.6
BIC	907.4	907.4	856.5	856.5	855.6	855.6
Log.Lik.	-423.892	-423.892	-401.765	-401.765	-401.302	-401.302
RMSE	0.42	0.42	0.42	0.42	0.42	0.42
HitRate	0.73	0.73	0.74	0.74	0.73	0.73
Sensitifity	0.82	0.82	0.81	0.81	0.81	0.81
Specificity	0.62	0.62	0.64	0.64	0.63	0.63
Ratio.1.0	1.32	1.32	1.32	1.32	1.32	1.32
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Code

visreg(outlpm, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05), 
       main = "Effect of number of children under 6 on LF-participation - LPM")

Code

visreg(outprobit, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05), 
       main = "Effect of number of children under 6 on LF-participation - Probit")

Code

visreg(outlogit, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05), 
       main = "Effect of number of children under 6 on LF-participation - Logit")

10.2 Maximum likelihood estimation in general

The starting point for obtaining maximum likelihood estimators (MLE) is the assumption that the conditional distribution of the data – the data generating process – is known, except for a finite number of parameters \boldsymbol \theta which have to be estimated

So, we have the probability density function, (if y_i is a continuous RV) respectively the probability mass function (if y_i is a discrete RV) for observation i

f(y_i \, | \, \mathbf x_i; \boldsymbol \theta), \ \ \forall i

The joint density of n independent and (identically) distributed (i.i.d.) observations for this process is the product of the individual densities

f(y_1, \ldots, y_n \, | \, \mathbf X; \boldsymbol \theta) \ = \ \prod_{i=1}^n f(y_i \, | \, \mathbf x_i; \boldsymbol \theta)) \ = \ L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X)

The joint density also defines the Likelihood function L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X), which is a function of the unknown parameters \boldsymbol \theta, given the sample data (\mathbf y, \mathbf X)
- Note the reversed writing of the variables. This should simply indicate that we are interested in finding the parameters given the data (not that \boldsymbol \theta is random)
Remark: If the observations are not independent, as it is very often the case with time series, the joint density is more complicated and involves the covariance matrix of the observations y_i. For more complicated time series modes, in many cases the joint density can be obtained with the so called prediction-error decomposition of the likelihood which is calculated with the help of the Kalman-filter

As it usually simper to work with logs, the log-likelihood function is

\ln L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X) \ = \ln L(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \ln f(y_i \, | \, \mathbf x_i, \boldsymbol \theta)

We apply the Maximum Likelihood principle: In estimating \boldsymbol \theta, we are looking for parameters \boldsymbol \theta so that the observed sample has the highest probability

\underset{\boldsymbol \theta}{\max} \ \ln L(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \ln f(y_i \, | \, \mathbf x_i, \boldsymbol \theta)

The first order conditions of this optimization problem lead to the so called likelihood equations, which are the counterparts to the normal equations of OLS

\dfrac { \partial \ln L}{\partial \boldsymbol \theta} := \ \mathbf g(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \frac {\partial }{\partial \boldsymbol \theta} \, \ln f(y_i | \mathbf x_i, \boldsymbol \theta) \ = \ \mathbf 0

The solution of the likelihood equations leads to the estimated parameter vector \hat{\boldsymbol \theta}_{LM}
- This estimate is identified (estimable) if for any other \boldsymbol \theta, L(\hat{\boldsymbol \theta}_{LM}) \neq L(\boldsymbol \theta) (uniqueness)
- In most of the cases, the optimization has to rely on numerical optimization procedures like the Newton-Paphson algorithm
The second order condition for maximization of the likelihood function requires that the Hessian matrix is negative definite, so that the likelihood function is concave at the optimum

10.2.1 Properties of MLE

If the distributional assumption holds Maximum Likelihood estimators are generally

Consistent
- \operatorname{plim} \hat{\boldsymbol \theta}_{ML} = \boldsymbol \theta_T, \quad with \boldsymbol \theta_T being the true parameter vector
- But in finite samples, MLE are not generally unbiased
  \text{ }
Asymptotically normal with

\hat {\boldsymbol \theta}_{ML} \ \, \stackrel{a}{\sim} \ \, N \left( \boldsymbol \theta_T ,\, \lbrace -E[ \mathbf H(\boldsymbol \theta_T)] \rbrace^{-1} \right)
- Thereby, \mathbf H(\boldsymbol \theta_T) is the Hessian Matrix of the log-likelihood function, evaluated at the true parameter vector \boldsymbol \theta_T:

\operatorname {Asy.Var} (\hat {\boldsymbol \theta}_{ML}) \, = \, \ \left( - E \left[ \mathbf H(\boldsymbol \theta_T) \right] \right) ^{-1} \, = \, \left( -E \left[ \dfrac {\partial^2 \ln L}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right] \right)^{-1} \, = \,

\left( -E \, \left[ \begin{array}{ccc} \dfrac {\partial^2 \ln L }{\partial \theta_0 \partial \theta_0} & \cdots & \dfrac {\partial^2 \ln L }{\partial \theta_0 \partial \theta_k} \\ \vdots & \ddots & \vdots \\ \dfrac {\partial^2 \ln L }{\partial \theta_k \partial \theta_0} & \cdots & \dfrac {\partial^2 \ln L }{\partial \theta_k \partial \theta_k} \end{array} \right] \right)^{-1}

\text{ }

Asymptotically efficient in the class of consistent and asymptotically normal estimators
- The Cramer-Rao Lower Bound is the asymptotic lower bound of the variance of a consistent and asymptotically normal estimator. And this can be shown to be equal to the inverse of the Information Matrix
  - The Information matrix \mathbf I_n(\boldsymbol \theta_T) for \boldsymbol \theta_T is defined as the covariance matrix of the gradient \mathbf g, the score, of the log-likelihood function evaluated at the true parameter vector. \mathbf I_n(\boldsymbol \theta_T) measures the amount of information about \boldsymbol \theta_T contained (on average) in a realization of the data
  \mathbf I_n (\boldsymbol \theta_T) := \operatorname {Var} \left( \mathbf g(\boldsymbol \theta_T) \right) = E \left( \mathbf g(\boldsymbol \theta_T) \, \mathbf g(\boldsymbol \theta_T)' \right),
  
  \text{with } \ \mathbf g(\boldsymbol \theta) := \dfrac {\partial }{\partial \boldsymbol \theta} \ln L(\boldsymbol \theta; \mathbf y), \ \ E[\mathbf g(\boldsymbol \theta_T)] = \mathbf 0
  - Under fairly general conditions, the Information Matrix is equal to the negative expected value of the Hessian Matrix, evaluated at the true parameters. This is the Information matrix equality
  \mathbf I_n (\boldsymbol \theta_T) = \operatorname {Var} \left( \mathbf g(\boldsymbol \theta_T) \right) = -E[\mathbf H(\boldsymbol \theta_T)]
  - The right hand term equals the inverse of the covariance matrix of \hat {\boldsymbol \theta}_{ML}. Hence, the variance of MLE is equal to Cramer-Rao Lower Bound and therefore, the MLE is asymptotically efficient
Finite sample property: If the sample is drawn from an exponential family distribution and if there exists an unbiased minimum variance estimator, then it is a MLE

10.2.2 Covariance matrix

Sketch of a proof that the MLE is asymptotically normal. We make a first order approximation of the first order conditions and assume that \mathbf y is i.d. (to easy the notation we write \hat{\boldsymbol \theta} instead of \hat{\boldsymbol \theta}_{LM})

\mathbf g (\hat {\boldsymbol \theta}) = \mathbf 0, \ \ \Rightarrow \ \ \mathbf g ( \hat {\boldsymbol \theta}) \approx \mathbf g (\boldsymbol \theta_T) + \underbrace{\mathbf H(\boldsymbol \theta_T)}_{\frac {\partial }{\partial \boldsymbol \theta} \mathbf g(\boldsymbol \theta_T)} (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = \mathbf 0 \ \ \Rightarrow

\mathbf g (\boldsymbol \theta_T) = -\mathbf H(\boldsymbol \theta_T)(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) \ \ \Rightarrow

(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = [-\mathbf H(\boldsymbol \theta_T)]^{-1} \mathbf g (\boldsymbol \theta_T) \ \ \Rightarrow

(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = [- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]^{-1} \frac{1}{n} \mathbf g (\boldsymbol \theta_T) \ \ \Rightarrow

\sqrt n (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = \underbrace {[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]^{-1}}_{\stackrel{p}{\longrightarrow} \ \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1}} \ \ \ \ \cdot \underbrace{\sqrt n\ \bar{\mathbf g} (\boldsymbol \theta_T)}_{\stackrel{d}{\longrightarrow} \ N(\mathbf 0,\, E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)])} \ \ \Rightarrow

\sqrt n (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) \ \stackrel{d}{\longrightarrow} \ N \left( \mathbf 0, \, \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1} \, E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)] \, \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1} \right) \ =

N (\mathbf 0, \, \lbrace E[-\frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1})

For the last two we invoke the i.d. assumption and the Information matrix equality. In this case

\mathbf g (\boldsymbol \theta) = \sum\nolimits_{i=1}^n \mathbf g_i(\boldsymbol \theta), \ \ \text{ with } \mathbf g_i(\boldsymbol \theta) := \dfrac {\partial \ln f(y_i|\mathbf x, \boldsymbol \theta)} {\partial \boldsymbol \theta}, \ \ \text{ and } \ \ \mathbf {\bar g} (\boldsymbol \theta) = \frac {1}{n} \sum\nolimits_{i=1}^n \mathbf g_i(\boldsymbol \theta)

\mathbf H(\boldsymbol \theta) = \sum\nolimits_{i=1}^n \mathbf H_i (\boldsymbol \theta), \ \ \text{ with } \mathbf H_i(\boldsymbol \theta) := \dfrac {\partial^2 \ln f(y_i|\mathbf x, \boldsymbol \theta)} {\partial \boldsymbol \theta \, \partial \boldsymbol \theta'}

\sqrt n\ \bar{\mathbf g} (\boldsymbol \theta_T) is \sqrt n times the average of random variables \mathbf g_i with an expected value of 0 and covariance matrix \operatorname {Var}\left( \mathbf g_i (\boldsymbol \theta_T) \right) = -E[\mathbf H_i (\boldsymbol \theta_T)] (by the Information matrix equality) and an average covariance matrix of -E[\frac{1}{n} \sum_i \mathbf H_i (\boldsymbol \theta_T)] = -E[\frac {1}{n} \mathbf H (\boldsymbol \theta_T) ]. Hence, we can invoke a CLT
-\frac{1}{n} \mathbf H(\boldsymbol \theta_T) is an average too and, by LLN, converges in probability to its expected value -E[\frac {1}{n} \mathbf H(\boldsymbol \theta_T)]
From the limiting distribution of \sqrt n (\hat {\boldsymbol \theta}_{LM} - \boldsymbol \theta_T) we get the asymptotic distribution of \hat{\boldsymbol \theta}_{ML}

\hat {\boldsymbol \theta}_{LM} \ \, \stackrel{a}{\sim} \ \, N \left( \boldsymbol \theta_T ,\, \lbrace -E[ \mathbf H(\boldsymbol \theta_T)] \rbrace ^{-1} \right)

This basically implies consistency of \hat {\boldsymbol \theta}_{LM} too. The essential point was E[\mathbf g(\boldsymbol \theta_{T})] = \mathbf 0 in third last line of the proof

As the expected value in the covariance matrix formula is generally difficult to obtain, the covariance matrix of the estimated parameters is simply estimated by the inverse of the actual (not the expected) negative Hessian matrix at \hat{\boldsymbol \theta}_{LM}

\widehat{\operatorname {Var} (\hat{\boldsymbol \theta})} \ = \ -\mathbf H(\hat {\boldsymbol \theta})^{-1} \ =

\left( -\dfrac {\partial^2 \ln L(\hat{\boldsymbol \theta})}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right)^{-1} \ = \ \left( - \dfrac { \sum_{i=1}^n \partial^2 \ln f(y_i | \mathbf x; \hat{\boldsymbol \theta})}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right)^{-1} \ = \ \left( -\sum_{i=1}^n \mathbf H_i(\hat {\boldsymbol \theta}) \right)^{-1}

with f(y_i | \mathbf x; \boldsymbol \theta) being the conditional distribution function of observation i

Another often used estimator for the covariance matrix involves the outer product of the gradient (score) vectors. This is based on the fact shown above that the Information matrix is defined as the variance of the scores, which is the expected value of the outer product of the gradient vectors

\widehat{\operatorname {Var} (\hat{\boldsymbol \theta})} \ = \ \left( \sum_{i=1}^n \mathbf g_i(\hat{\boldsymbol \theta}) \, \mathbf g_i(\hat{\boldsymbol \theta})' \right)^{-1}

This is the so called BHHH estimator and is comparatively easy to obtain. This estimate is always positive definite, but in small samples the estimator using the Hessian Marix generally performs better

Why does the variance of the estimated parameters depend on the second derivatives?
- The second derivative measures how concave the function is. If this is (in absolute terms) small at the optimum (nearly linear, with a slope of 0), the likelihood function is flat at the optimum. This means that there is a large interval of other parameters having nearly the same likelihood. Hence, the variance of the estimated \theta is high
- On the other hand, if the second derivative is large in absolute terms, the likelihood function is very curved at the optimum, meaning that there is only a small interval of parameters having nearly the same likelihood. Therefore, the estimate is precise and the variance of \hat \theta is low
Hence, we have an inverse relationship between the size of the second derivatives of the likelihood function and the variance of the estimated parameters; and this inverse rekationship is reflected with the covariance matrix formula
As we already know, the second order condition for maximization of the likelihood function requires that the Hessian matrix is negative definite, so that the likelihood function is concave at the optimum
- For the Logit model we have: \mathbf H = - \sum_i \Lambda(\mathbf x_i \boldsymbol \beta) (1-\Lambda(\mathbf x_i \boldsymbol \beta) )\mathbf x_i \mathbf x_i', which is clearly negative definite for all \boldsymbol \theta
- As similar result applies to the Probit model which means that the likelihood functions are always concave for the two models, guaranteeing a global maximum

10.2.3 Hypothesis testing

Because we have an estimate of the covariance matrix of \hat {\boldsymbol \theta} we can carry out all of the usual Wald tests about linear hypothesis (after linearization, nonlinear restrictions are also possible - the restriction matrix \mathbf R is replaced by the Jacobi Matrix of the nonlinear restrictions)
- With this test procedure, only the unrestricted model have to be estimated. If the restriction is not compatible with the data, i.e., the restricted estimate \mathbf R \hat{\boldsymbol \theta} is too far away from the hypothesized value, the restriction is rejected
- The big advantage of this test strategy is that with a large enough sample no assumption about the distribution is required (we do not need the likelihood function)

In the context of MLE, we have an estimate of the log-likelihood of the observed sample. This can be exploited for a Likelihood Ratio (LR) test
- With this procedure, the restricted as well as the unrestricted model is estimated. We are then looking at the log-ratio of the likelihood of the two models. If the difference of the log-likelihood of the two models is too large, the restricted model can be rejected. The test statistic is asymptotically \chi^2-distributed with m degrees of freedom, m being the number of restrictions

LR \ = \ 2 \left( \ln L(\hat{\boldsymbol \theta}_{ur}) - \ln L(\hat{\boldsymbol \theta}_r) \right) \ \stackrel{a}{\sim} \ \chi^2(m)

The third test concept, which is also derived from the Maximum Likelihood principle, is the Lagrange Multiplier test (LM)
- With this procedure, only the restricted model is estimated. To obtain the parameter of the restricted model the log-likelihood function is maximized with respect to m constraints \mathbf c(\boldsymbol \theta)=\mathbf 0, using the Lagrange function
\underset{\boldsymbol \theta}{\max}: \ \ln L(\boldsymbol \theta; \mathbf y, \mathbf X) - \boldsymbol \lambda' \mathbf c(\boldsymbol \theta)
- The first order condition shows that a restriction is only binding (has an influence on the estimated parameters) if the corresponding element of the (m \, \text{x} \, 1) Lagrange multiplier vector \boldsymbol \lambda is different from zero
\underbrace {\dfrac {\partial }{\partial \boldsymbol \theta} \ln L(\boldsymbol \theta; \mathbf y, \mathbf X)}_{\mathbf g(\boldsymbol \theta)} - \boldsymbol \lambda' \dfrac {\partial }{\partial \boldsymbol \theta} \mathbf c(\boldsymbol \theta) = \mathbf 0
- If the restriction is of the form \theta_j = c_j, with c_j some constants, the corresponding derivative of the constraint is one, \dfrac {\partial }{\partial \theta_j} \mathbf c(\boldsymbol \theta) = 1 and then, from the first order conditions, we have
\dfrac {\partial }{\partial \theta_j} \ln L(\boldsymbol \theta; \mathbf y, \mathbf X) = \lambda_{\tilde j}
- Thus, the vector \boldsymbol \lambda, which is often interpreted as shadow prices for the restrictions \theta_j = c_j, is equal to the corresponding derivatives of the log-likelihood function. Therefore, \lambda_{\tilde j} measures the extent to which the first order conditions with respect to to the unconstrained model are violated (without restriction, the corresponding derivative should differ from \mathbf 0 only because of sampling errors)
- Hence, one idea is to test whether \boldsymbol \lambda is zero. A similar and more common test is directly based on the gradient vectors \mathbf g(\boldsymbol \theta_r), the score vectors, instead of \boldsymbol \lambda; hence, the label score test. This as well measures to what extent the gradient differs from \mathbf 0, the expected value of \mathbf g(\boldsymbol \theta) in the unconstrained optimum
- It turns out that \mathbf g(\boldsymbol \theta) is asymptotically normal distributed and we have a “Wald-like” LM test statistic
LM \ = \ \mathbf g(\boldsymbol \theta_r)' \, \mathbf I(\boldsymbol \theta_r)^{-1} \mathbf g(\boldsymbol \theta_r) \ \ \stackrel{a}{\sim} \ \ \chi^2(m)

with \mathbf I(\boldsymbol \theta_r) equal to of the inverse Information matrix, evaluated at the restricted parameter vector

An asymptotic equivalent procedure to the score test is to first run a restricted regression and then, in a second step, an auxiliary regression using the residuals of the first step to test for relaxation of the restrictions
- Typically, the test statistic for this variant of the LM test is \, n \cdot R^2, which is \chi^2(m) distributed
- This procedure is especially suited for testing misspecifications, as in this case the restricted variant is typically more easily to estimate than the unrestricted model. Examples are tests for
  - Heteroskedasticity (Breusch-Pagan test)
  - Serial correlation (Breuch-Godfrey test)
  - General misspecifications (RESET)
  - Endogenous regressors (Hausman-Wu test)
Tests for exclusions of variables or linear restrictions among variables are typically more easily to carry out with a Wald test
All three test procedures introduced – Wald tests, LR tests and LM tests – are only suitable for nested models and are asymptotically equivalent
- The small sample properties of LR and LM tests are generally unknown and the LM test is said to have large small samples distortions. On the other hand, the LM test is known to have a larger power against small deviations from the H_0

Three test procedures

Code

library(magick)
img = image_read("topic3_tests.jpg")
image_scale(img, "600")

Figure 10.3: Principles of the three different test procedures, source Green (2017)

10.2.4 Measures of fit

We already dealed with the hit rate in the context binary choice models, with the refinements of sensitivity and specificity
Pseudo R-squared: Compare the maximized log-likelihood of the model with that of a model that only contains a constant (and no explanatory variables); McFadden’s \tilde{R}^2

\tilde{R}^{2}=1- \dfrac {\ln L_{u r}} {\ln L_{0}}

Correlation based measures: Look at correlation (or squared correlation) between predicted values or predicted probabilities versus observed values

\operatorname {Corr} (y_i, \hat y_i)
- In the context of continuous dependent variables models, this corresponds to the usual R^2
Information criteria:
- Akaike information criterion, AIC = -2 \ln L + 2k
- Bayesian (Schwarz) information criterion BIC = -2 \ln L + k \ln n
  - Both are a measure of the information loss due to an imperfect model. They deal with the trade-off between the goodness of fit and the simplicity of the model, thereby considering both the risk of overfitting and underfitting
  - They are based on the log-likelihood of the model but place a penalty to the number of estimated parameters like the R-squared bar
  - These criteria can be used to judge also on non-nested models (if y is the same): The lower the better – this strategy is said to lead to better out of sample predictions

10.2.5 Digression: Intoduction to Bayesian Statistics

For an introduction to Bayesian statistics watch that video:

https://www.youtube.com/watch?v=Pahyv9i_X2k

As the video above shows, the basic idea of Bayesian statistics is to update prior information about a parameter vector \boldsymbol \theta with observed data. Bayes Theorem implies that the posterior distribution of the parameter vector \boldsymbol \theta, f(\boldsymbol \theta|data), is proportional to the likelihood function f(data|\boldsymbol \theta) times the prior distribution f(\boldsymbol \theta). An important economic application is estimation of DSGE models.

However, the posterior distribution is often very complicated and difficult or impossible to integrate. For instance, we need to integrate to get the expectation of the posterior E(\boldsymbol \theta|data) = \int_\theta \boldsymbol \theta f(\boldsymbol \theta|data) d\boldsymbol \theta.

A solution of this problem are sampling algorithms, which draw samples of \boldsymbol \theta from an arbitrary proposal distribution in a way so that we finally get draws that are asymptotically distributed like the posterior distribution. If we have such a large sample of \boldsymbol \theta from the posterior it is very easy to get the expectation of \boldsymbol \theta (its simply the mean) or credible intervals (the Bayesian counterparts to confidence intervals).

The Markov Chain Monte Carlo (MCMC), and in particular the Metropolis Hastings (MH) algorithms are such powerful sampling algorithms. By the way, this algorithms are also very important in many other fields like climate modeling, physics or big data econometrics. To learn more watch the following video:

https://www.youtube.com/watch?v=U561HGMWjcw

10.3 Count data

Count data arise when the dependent variable is a counting number per unit of time or space – e.g., the number of restaurants meals taken by a consumer during a week or the number of arrest of people during a year. So, y_i = \lbrace 0, 1, 2, \ldots, \rbrace

If these numbers are large and widespread, the variable can be considered as continuous and hence, OLS would be appropriate
If these numbers are small, quite rare events, then the continuous approximation is a poor one
- In this case, OLS can still yield consistent estimates and the predicted values are interpreted as conditional expected values of the independent variable
- However, OLS does not account for the special structure of the data, leading to inefficiency and nonsense predictions, like -0.2 restaurant meals per week
One remedy to avoid negative predictions is to postulate

E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}

and apply NLS. However, this would be not efficient as the the error term is heteroscedastic
Therefore, maximum likelihood estimation is usually employed for this type of models. As count data are often Poisson distributed this leads to the Poisson regression model (or to the more general negative binomial regression model)

The Poisson distribution is derived form a limit of the binomial distribution, which we already encountered with the Logit and Probit models. If the number of independent (Bernoulli) draws goes to infinity and the success probability p goes to zero, so that n\cdot p \rightarrow \lambda, the binomial distrib. converges to

P(Y = y_i) = \dfrac {\lambda^{y_i}}{y_i!} e^{-\lambda}

And this is called Poisson distribution, with y_i the number of counts
The parameter \lambda is equal to both the expectation and the variance of counts! This distribution is considerably right-skewed for a small \lambda, but for \lambda>30, it’s almost identical to a N(\lambda, \lambda) normal distribution
As we are interested in the effects of the explanatory variables on y (the number of counts), we must look at the Poisson distribution conditional on \mathbf x. We only have to specify \lambda = E[y \, | \, \mathbf x], which we set to

\lambda = E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}

as above – \lambda has to be positive!

Given a sample of n observations, we can construct the log-likelihood function, which is the joint log-probability that each y_i is observed

\ln L(\boldsymbol \beta) \, = \, \sum\nolimits_{i=1}^{n} y_i \ln \lambda - \lambda - \ln(y_i!) \, = \, \sum\nolimits_{i=1}^{n} y_i \, {\mathbf x_i \boldsymbol \beta} - e^{\mathbf x_i \boldsymbol \beta} - \ln(y_i!)

Note, that the last term, \ln(y_i!), is independent of \boldsymbol \beta and could be omitted

The estimates \hat{\boldsymbol \beta} are obtained by numerical maximization of the log-likelihood function
As we have

E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}

the partial effects are

\dfrac {\partial E[y \, | \, \mathbf x]}{\partial x_j} = {\underbrace {e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}}_{E[y \, | \, \mathbf x]}} \, \beta_j \ \ \Rightarrow \tag{10.2}

\dfrac {\partial E[y \, | \, \mathbf x]}{\partial x_j} \dfrac {1}{E[y \, | \, \mathbf x]} \ = \ \dfrac {\partial \ln E[y \, | \, \mathbf x]}{\partial x_j}{} \ = \ \beta_j \tag{10.3}

This is a semi-elasticity. A change in x_j of 0.01 leads to a \beta_j percentage change in E[y \, | \, \mathbf x]. For larger variations in x_j (e.g. for a dummy variable) use the exact formula (e^{\beta_j \Delta x_j}-1)100=\Delta%

The Poisson regression model assumes that \operatorname{Var}(y|\mathbf x) = E(y|\mathbf x) which is very restrictive and often violated in applied work
If \operatorname{Var}(y|\mathbf x) > E(y|\mathbf x), we have overdispersion (which is quite common), in the reverse case underdispersion (which is not so common)
Fortunately, MLE of the Poisson regression model are consistent and asymptotically normal even if the Poisson distribution does not hold. This is then called Quasi-Maximum Likelihood estimation (QML). (This is similar to the case of OLS when the disturbance is not normally distributed)
If the distributional assumption does not hold and QML is used, standard errors are wrong. One then has to compute adjusted standard errors
- The residuals of the Poisson model are as usual \hat u_i = y_i-\hat y_i. To judge whether there is overdispersion the residuals are scaled (divided) by \sqrt{\hat y_i}. This is a proper heteroscedasticity correction. Additionally, \hat y is a predictor for E(y) and therefore also for \operatorname{Var}(y|\mathbf x). Hence, the variance of these scaled residuals should be one if the condition \operatorname{Var}(y|\mathbf x) = E(y|\mathbf x) is true
- If the calculated standard error of these transformed residuals \hat \sigma is larger than one we have overdispersion. In this case the estimated standard errors of the coefficients should be multiplied by \hat \sigma

10.3.1 Example

As an example we analyze the crime1 data set of the Wooldridge Package. The dependent variable narr86 is the number of times a man is arrested during 1986. This variable is zero for 1970 of the 2725 men in the sample and only eight values of narr86 are larger than five. So, the Poisson regression model is more appropriate than OLS
Some interesting exploratory variables are
- pcnv: proportion of prior convictions
- tottime: time in prison since 18, in month
- ptime86: month in prison during 1986
- inc86: legal income 1986, in $100s
- black: =1 if black
- hispan: =1 if Hispanic
We estimate the model with OLS, as well as Poisson regression model and Quasi-Poisson regression model, the latter only differ by the correction of the standard errors because of overdispersion
Additionally, we calculate the APE, which are comparable to the OLS estimates

The OLS estimate for pcnv is -0.132 which means that with \Deltapcvn = 0.1, the expected number of arrest during 1986 decreases by 0.013
The corresponding coefficient of the Poisson model is 0.402. But this is now a semi-elasticity: With \Deltapcvn = 0.1, the expected number of arrest during 1986 decreases by 4.02%. Calculating the APE (which averages Equation 10.2) with the Package marginaleffects, we get the absolute effect of -0.162, which is now comparable to the OLS estimate
The effect of black in the OLS model is 0.327, meaning that for black men the expected number of convictions is 0.327 higher than for white men
The Poisson coefficient for black is 0.661. In this case (a large \Delta x_j), we should use the precise formula, e^{\beta_j \Delta x_j}-1, leading to exp(0.661)-1, which is about 94% higher than for white man. Calculating the APE (which averages Equation 10.2), we get the effect in absolute terms, 0.328, which is about the same as the estimated OLS effect
The SE of the transformed residuals of the Poisson model is 1.232, showing a moderate overdispersion. The column QPoisson shows the corrected standard errors for the estimated \beta_j
The estimated R^2, which is the squared correlation of actual versus predicted values of narr86, shows that the overall fit of the Poisson regression model is better than that of the linear regression model, although the OLS coefficients are chosen to maximize R^2. But OLS is restricted to a linear relationship

library(wooldridge)
data("crime1")

# OLS
outlm <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + 
               inc86 + black + hispan + born60 , 
             data = crime1)


# Poisson - family = poisson and the default link is "log", 
# so the inverse default link is "exp"
outpoisson <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + 
                    inc86 + black + hispan + born60 , 
                  data = crime1, family = poisson)


# Quasi-Poisson - this only adjusts the standard errors 
outquasipoisson <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + 
                         inc86 + black + hispan + born60 , 
                       data = crime1, family = quasipoisson)


# Calculating APE with "marginaleffects"
marg_lm <- avg_slopes(outlm)
marg_pois <- avg_slopes(outpoisson)
marg_qpois <- avg_slopes(outquasipoisson)

Code

# Reset for modelsummery for glm()
glance_custom.glm <- function(x, ...) {
  
}

modelsummary( list("OLS"=outlm, "Poisson"=outpoisson, "Poisson APE"=marg_pois,
                   "QPoisson"=outquasipoisson, "QPoisson APE"=marg_qpois), 
              output="flextable", 
              gof_omit = "F", 
              align = "lddddd", 
              stars = TRUE, 
              fmt = 3)

Table 10.2:
Number of arrests, depending on serveral variables, several count models
	OLS	Poisson	Poisson APE	QPoisson	QPoisson APE
(Intercept)	0.577***	-0.600***		-0.600***
	(0.038)	(0.067)		(0.083)
pcnv	-0.132**	-0.402***	-0.162***	-0.402***	-0.162***
	(0.040)	(0.085)	(0.035)	(0.105)	(0.043)
avgsen	-0.011	-0.024	-0.010	-0.024	-0.010
	(0.012)	(0.020)	(0.008)	(0.025)	(0.010)
tottime	0.012	0.024+	0.010+	0.024	0.010
	(0.009)	(0.015)	(0.006)	(0.018)	(0.007)
ptime86	-0.041***	-0.099***	-0.040***	-0.099***	-0.040***
	(0.009)	(0.021)	(0.008)	(0.025)	(0.010)
qemp86	-0.051***	-0.038	-0.015	-0.038	-0.015
	(0.014)	(0.029)	(0.012)	(0.036)	(0.014)
inc86	-0.001***	-0.008***	-0.003***	-0.008***	-0.003***
	(0.000)	(0.001)	(0.000)	(0.001)	(0.001)
black	0.327***	0.661***	0.328***	0.661***	0.328***
	(0.045)	(0.074)	(0.045)	(0.091)	(0.055)
hispan	0.194***	0.500***	0.235***	0.500***	0.235***
	(0.040)	(0.074)	(0.040)	(0.091)	(0.050)
born60	-0.022	-0.051	-0.020	-0.051	-0.020
	(0.033)	(0.064)	(0.025)	(0.079)	(0.031)
Num.Obs.	2725	2725	2725	2725	2725
R2	0.072
AIC	6721.4	4517.5	4517.5
BIC	6786.4	4576.6	4576.6
Log.Lik.	-3349.678	-2248.761	-2248.761
RMSE	0.83	0.83	0.83	0.83	0.83
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

10.4 Censored and truncated data

In many economic contexts, decision problems are such that either a positive amount or a zero amount is chosen, e.g., the demand for cars or alcohol (corner solutions). In this case, we have a form of censored data and the distribution of car expenditures is a combination of a discrete distribution (at 0) and a continuous distribution
Another form of censored data is when the dependent variable is censored from above or below a certain threshold, typically due to survey design; e.g. if respondents of a survey are allowed to respond a question about their wealth with more than 1 Mil. Typically, there is no optimization problem modeled
In the first example for censored data we have expenditure data for buyers and non-buyers (which are zero). However, if the data are collected from sales tax records, then the data would only include buyers
Data, in which observations are unavailable above or below a certain threshold are called truncated data. Note, for truncated data we neither observe y_i nor \mathbf x_i, in contrast to censored data. The sample is not random any more if the selection mechanism is related to the value of the dependent variable
- Remark: Selection mechanisms which are exogenous, i.e., the dependent variable and its error term are not involved, pose no special estimation problems
  - Example:
    - Suppose we want to estimate the relationship between the saving rate and income, but we have oversampled people with high income (exogenous sampling mechanism, which is based on an exogenous (rhs) variable). If the same relationship between savings and income is true for all people, this would not lead to biased estimates, (although with higher variance – less variation in the explanatory variable)
    - However, if we have oversampled people with a high saving rate (endogenous sampling mechanism) this would lead to biased estimates even if the same relationship between savings and income is true for all people (sample selection problem)
A linear regression model may be inadequate for both censured and truncated data
Figure 10.4 below shows the effects of censored and truncated data for estimating the true (population) relationship. In both cases we see a significant bias towards zero in the estimates based on the observed data. Note, regarding the truncated sample, we have an endogenous sampling mechanism related to the value of y – thus there is a sample selection problem

Code

library("ggplot2")
library("dplyr")
library("tidyverse")

baw <- FALSE
set.seed(2020)
N <- 1000
X = rnorm(N, 3, 4)
Y = - 1 + 0.5 * X + rnorm(N, 0, 1.8)
z <- data.frame(X = X, Y = Y)

z <- mutate(z,
            neg = factor(Y < 0, labels = c("yes", "no")),
            Yc = ifelse(Y > 0 , Y, 0),
            Yt = ifelse(Y > 0, Y, NA))
z2 <- gather(z, key, value, -X, - neg)

colols <- ifelse(baw, "black", "#AA4444")

ys <- c("Y" = "Whole sample", "Yc" = "Censored sample", "Yt" = "Truncated sample")

gp <- ggplot(z2, aes(X, value)) + geom_point(size = 0.6, aes(colour = neg)) + 
    geom_smooth(method = "lm", se = FALSE,  lty = "twodash", color = colols) +
    geom_abline(intercept = -1, slope = 0.5) +
    facet_wrap(~ key, ncol = 3, labeller = labeller(key = ys)) +
    guides(colour = FALSE)  + xlab("") + ylab("") + 
    theme(legend.text = element_text(size = 6), 
          legend.title= element_text(size = 8),
          axis.title = element_text(size = 8))

if (baw) gp <- gp + scale_color_grey()
gp

Figure 10.4: Simulated data, censored or truncated, and the corresponding regression lines compared to the true relationship

Some examples for typical count and censored data

library(wooldridge);  
data("crime1")

# Number of arrest - typical count data
hist(crime1$narr86, main = "Typical count data, should look like a Poisson distribution")

library(wooldridge);  
data("mroz")

# Number of hours worked - not everybody work 
# Data apparently not Poisson distributed
hist(mroz$hours, main = "Typical censored data, (>= 0)")

library(wooldridge);  
data("recid") 

# Duration time until next arrest. We have different release times from prison, 
# hence different thresholds as the endpoint for observations is fixed
hist(recid$durat, main = "Censored data from above with different but known thresholds")

10.4.1 Truncated Normal Regression Model

Let’s start with the analysis of truncated regression models. Characteristic for these is that the outcome y_i and the explanatory variables \mathbf x_i are only observed if the outcome y_i is less (larger) or equal some value c_i

In this case, the sample is not a random sample from the population (because some units will never be a part of the sample)
The truncated normal regression model is:

y_i = \mathbf x_i \boldsymbol \beta + u_i

u_i \, | \, \mathbf x_i \sim N(0, \sigma^2)

(y_i, \mathbf x_i) \ \text{ only observed if } \ y_i > c_i \ \ \ (\text{or } \ y_i < c_i) \tag{10.4}

Truncation from below or left: observe y_i and \mathbf x_i if y_i > c_i
Truncation from above or right: observe y_i and \mathbf x_i if y_i < c_i
In the following we examine truncation from below. Truncation from above is handled analogous and does not require any new results, Green (2017)

Truncated normal distribution

Theorem 10.1 (Truncated normal distribution, see Green (2017)) Let y be a normally distributed random variable with density f(y), mean E(y)=\mu and variance \operatorname {Var}(y)=\sigma^2 and a cutoff constant c. Then the conditional density function f(y|y>c) is

f(y) \ \equiv \ f(y|y>c)\cdot P(y>c) \ \ \Rightarrow

f(y|y>c) = \frac{f(y)}{P(y>c)} = \frac{f(y)}{P(y-\mu>c-\mu)} =

\frac{f(y)}{P[(y-\mu)/\sigma > (c-\mu)/\sigma]} \, = \, \frac{f(y)}{1-F[(c-\mu)/\sigma]} \, =

\frac{\frac{1}{\sigma}\phi \left( \dfrac{y-\mu}{\sigma}\right)}{1-\Phi\left(\dfrac{c-\mu}{\sigma}\right)} \ \tag{10.5}

As already defined above, \phi(z) denotes the probability density and \Phi(z) the cumulative distribution function (cdf) of the standard normally distribution (see Equation A.3 and Equation A.4).

The mean and variance are given by:

\begin{align*}E(y|y>c) &= \mu + \sigma \, \lambda \left(\frac{\mu-c}{\sigma} \right) \\ \operatorname{Var}(y|y>c) &= \sigma^2 \left[ 1-\delta \left(\frac{\mu-c}{\sigma}\right) \right] \end{align*} \tag{10.6}

where \lambda(x) is the Inverse Mills Ratio (or hazard function of the standard normal distribution): ¹

\lambda(x) := \dfrac {\phi(x)}{\Phi(x)}, \ \ \lambda(x) >0, \ \lambda'(x) <0, \ \forall x \tag{10.7}

\delta(x) := \lambda(x)[\lambda(x)-x], \ \ 0<\delta(x)<1

Remarks: E(y|y>c) > E(y) and \operatorname {Var}(y|y>c) < \operatorname {Var}(y).

The Inverse Mills Ratio \lambda(x) (IMR) is fundamental in models with incomplete data or self-selective samples. It can be interpreted as a measure of the probability that a particular observation is selected (is in the sample) compared to those that are not selected (not in the sample). A lower IMR indicates a higher probability of selection.

Code

x <- seq(-3, 3, length=100)

# plotting normal distribution
plot(x, dnorm(x), type="l", ylim=c(0, 1.125)) 

# plotting truncated normal distribution, truncated from below at -0.5
a=-0.5
xt <- seq(a, 3, length=100)
lines(xt, dnorm(xt)/(1-pnorm(a)) )
o <- dnorm(a)/(1-pnorm(a))
segments( a,0,  x1=a, y1=o, lty = 3) 

# plotting truncated normal distribution, truncated from below at 0.5
a=0.5
xt <- seq(a, 3, length=100)
lines(xt, dnorm(xt)/(1-pnorm(a)) )
o <- dnorm(a)/(1-pnorm(a))
segments( a,0,  x1=a, y1=o, lty = 3)

Figure 10.5: Normal and truncated normal distributions for c=-0.5 and c=0.5. Note, the area below the density has to be always one

Estimation

Applying OLS would not yield correct results because MLR.2 is violated (see Section 2.7). To see this note that, according to the theorem above, the expectation of y_i for the observed subpopulation y_i > c_i is:

E(y_i|y>c_i) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \lambda \left( \frac{\mathbf x_i \boldsymbol \beta - c_i }{\sigma} \right) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \tag{10.8}

Hence, regressing y_i only on \mathbf x_i leads to an omitted variable bias as the omitted variable \lambda(\mathbf x_i) (the inverse Mills Ratio, Equation 10.7) is a nonlinear function of \mathbf x_i and therefore probably correlated with \mathbf x_i
- In particular, for observations with a low a-priory probability to be in the sample (because \mathbf x_i \boldsymbol \beta is low), the variable \lambda(\mathbf x_i) is high, thereby correcting for the selection bias in Equation 10.8

The partial effect of x_j on E(y_i|y>c_i), which we want to estimate, is:

\dfrac {\partial E(y_i|y>c_i)}{\partial x_j} \, = \, \beta_j + \sigma \, \lambda' \, \dfrac {\partial \left( \frac{\mathbf x_i \boldsymbol \beta - c_i }{\sigma} \right) }{\partial x_j} \, = \, \beta_j(1 + \underset {(-)} {\lambda'}) \tag{10.9}

This can be shown to be less than \beta_j. Hence, we have a bias towards zero (compare Figure 10.4) ²

As just shown, OLS is biased and inconsistent for truncated data because of an omitted variable problem.

Thus, Maximum Likelihood estimation seems appropriate.

According to Theorem 10.1, the density of an observed outcome y_i, conditional on the explanatory variables and the y > c_i is:

f\left(y_{i} \, | \, \mathbf{x}_{i}, y_i > c_{i}\right) \ = \ \frac{\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} {1 - \Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \tag{10.10}

Maximizing the log-Likelihood function yields consistent estimates of \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2

\underset{\boldsymbol \beta, \sigma}{\max} \ \ln L(\boldsymbol \beta, \sigma^2) \ = \ \sum_{i=1}^n \ln \frac {1}{\sigma }{\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} - \ln \left[ 1 - {\Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \right]

\rightarrow \ \ \ \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2

Note, that non-normality or heteroscedasticity in u_i lead to inconsistent estimates
Further note, the untruncated data should be approximately continuous to be appropriately modeled with this approach (not count data with only few counts!)

Example

Code

# We simulate data and store them in the data.frame d
set.seed(1234890)
n <- 10000

# true parameters, which we want to estimate 
sigma <- 4
beta_0 <- 2
beta_1 <- 1;

x <- rnorm(n, mean = 0, sd = 2); 
eps <- rnorm(n, sd = sigma)

# the true model 
y <- beta_0 + beta_1 * x + eps

d <- data.frame(y = y, x = x)

library(truncreg) # Truncated Regression package

# generating truncated data y = yt, at c <= 1
d$yt <- ifelse(d$y > 1, d$y, NA)

# compare estimates for full/truncated/ data
# we use the function `truncreg` from the package `truncreg`
full  <- lm(y ~ x, data = d)
ols   <- lm(yt ~ x, data = d)
trunc <- truncreg(yt ~ x, data = d, point = 1, direction = "left")

summary(trunc)

      
      Call:
      truncreg(formula = yt ~ x, data = d, point = 1, direction = "left")
      
      BFGS maximization method
      45 iterations, 0h:0m:0s 
      g'(-H)^-1g = 7.38E-09 
       
      
      
      Coefficients :
                  Estimate Std. Error t-value  Pr(>|t|)    
      (Intercept) 2.119760   0.170904  12.403 < 2.2e-16 ***
      x           0.994784   0.046537  21.376 < 2.2e-16 ***
      sigma       3.972827   0.079421  50.022 < 2.2e-16 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Log-Likelihood: -13547 on 3 Df

Code

modelsummary( list("OLS full data"=full, "OLS trunc data"=ols, "MLE trunc data"=trunc), 
              output="flextable", 
              gof_map = c("nobs", "aic", "bic", "rmse"), 
              align = "lddd", 
              stars = TRUE, 
              fmt = 3)

Table 10.3:
Simulated truncated data, estimated with OLS and MLE. True values are b0=2, b1=1 and sigma=4
	OLS full data	OLS trunc data	MLE trunc data
(Intercept)	2.081***	4.707***	2.120***
	(0.040)	(0.037)	(0.171)
x	1.035***	0.474***	0.995***
	(0.020)	(0.019)	(0.047)
sigma			3.973***
			(0.079)
Num.Obs.	10000	5967	5967
AIC	56035.8	29009.9	27099.7
BIC	56057.4	29030.0	27119.8
RMSE	3.99	2.75	3.69
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Compare Figure 10.4

10.4.2 Censored Normal Regression Model

Generally, censored data do not result from a truncation problem. Contrary to truncated data, especially the explanatory variables \mathbf x_i are observed even if the outcome y_i is less (larger) or equal some value c_i
Typically, a dependent variable is censored from above or below a certain threshold due to survey design; e.g., values are only reported up to a certain level (e.g., top coded wealth)
- Another example for censoring from below is limited measurement accuracy, for instance tumor maker values or count of antibodies in medical data
With y_i^* being the true outcome and y_i being the observed outcome, the censored normal regression model is

y_i^* = \mathbf x_i \boldsymbol \beta + u_i

u_i \, | \, \mathbf x_i \sim N(0, \, \sigma^2)

y_i = \begin{cases} \max(y_i^*,c_i) \ \ \ \text{:left censoring (from below)} \\ \min(y_i^*,c_i) \ \ \ \ \text{:right censoring (from above)} \\ \max(y_i^*,\underline {c}_i) \ \land \ \min(y_i^*,\bar c_i) \ \ \ \text{:censoring from above and below} \end{cases} \tag{10.11}

In the following we examine censoring from below. Censoring from above is handled analogous

Distribution of censored data

Figure 10.6: Censored normal distributions at c=0, mixture of discrete & continuous distribution. In this example, half of the probabilty mass is concentrated in one point

For censored data, the distribution function of y_i is a mixture of a discrete distribution, concentrated at the threshold c_i, and a truncated continuous density; the normal density in our case

Such a distribution is shown in Figure 10.6 above with threshold c_i=0
- Note, in this case no renormalization of the truncated continuous density is necessary, because the area below the truncated density function plus the probability of the discrete mass point is one
We have two parts of the distribution of y_i (all conditional on \mathbf x_i, which we leave out to save notation):
1. For (left) censored observations, y_i = c_i, we have a discrete mass point with probability:
P(y_i = c_i) \,=\, P(y_i^* \leq c_i) \,=\, P(u_i \leq c_i - \mathbf x_i \boldsymbol \beta) \,=\, P \left( \frac {u_i}{\sigma} \leq \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \,=

\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.12}
1. For the uncensored observations, y_i = y_i^* > c_i, the density of y_i, conditional on y_i>c_i, corresponds to a truncated density f\left(y_{i} \, | \, y_i > c_{i}\right), already derived in the section on truncated data, see Equation 10.5 and Equation 10.10
For the whole two-part distribution of y_i we have

f(y_i \,|\, \mathbf x_i, c_i) \ = \begin{cases} P(y_i=c_i) = \Phi \left( \dfrac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right); \ \ \text{ if } y_i=c_i \\ f(y_{i} \, | \, y_i > c_{i}) \cdot P(y_i>c_i); \ \ \text{ if } y_i>c_i \end{cases} \tag{10.13}

The second part simplifies to

f(y_{i} \, | \, y_i > c_{i}) \cdot P(y_i>c_i) \ = \ \underbrace {\frac{\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} {1 - \Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)}}_{ f(y_{i} \, | \, y_i > c_{i})} \underbrace {\left [1 - \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \right ]}_{P(y_i>c_i)} \ =

\underbrace {\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)}_{f(y_i^*)=N(\mathbf x_i \boldsymbol \beta, \, \sigma^2)} \tag{10.14}

And this is simply the non-truncated normal distribution of the uncensored observations (as mentioned above, there is no need for a renormalization)

Estimation

This lead to the joint density of the observed vector \mathbf y:

f(\mathbf y \, | \, \mathbf X,c_i) \ = \ \prod_{y_i=c_i} \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \cdot \prod_{y_i>c_i} \frac{1}{\sigma} \phi \left( \frac {y_i-\mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.15}

Maximizing the log-likelihood function yields consistent estimates of \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2

\underset{\boldsymbol \beta, \sigma}{\max} \ \ln L(\boldsymbol \beta, \sigma^2) \ = \ \sum_{y_i=c_i} \ln \left[ {\Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \right] \ + \ \sum_{y_i>c_i}^n \ln \frac {1}{\sigma }{\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} \ \ \rightarrow \ \ \hat {\boldsymbol \beta}, \hat \sigma^2

Estimation has to be done by numerical optimization methods
After estimation, parameter test and exclusion restrictions can be done as usual with Wald, Likelihood Ratio or Lagrange Multiplier tests
Note, if the distributional assumptions on u_i (normality and homoscedasticity) are violated, the MLE above is inconsistent
Further note, the uncensored data should be approximately continuous to be appropriately modeled with this approach (not count data with only few counts!)

OLS applied to the observed data doesn’t yield consistent estimates for \boldsymbol \beta as the expectation of y_i, conditional on \mathbf x_i and y_i\geq c_i, is not \mathbf x_i \boldsymbol \beta but a highly nonlinear function in \mathbf x_i and \sigma

E(y_i) \ = \ E(y_i|y \geq c_i) \ = \ E(y_i|y_i > c_i) \cdot P(y_i > c_i) \, + \, \underbrace {E(y_i|y_i = c_i)}_{c_i} \cdot P(y_i = c_i)

Plugging in our known results regarding the truncation case E(y_i|y_i > c_i) (Equation 10.8) and for P(y_i = c_i) (Equation 10.12), respectively P(y_i > c_i), we have

E(y_i) \ = \ \underbrace {\left[ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \right]}_{E(y_i|y_i > c_i)} \underbrace {\left[ 1 -\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \right]}_{P(y_i>c_i) \, = \, [1-P(y_i=c_i)]} \, + \, c_i \, \underbrace {\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right)}_{P(y_i=c_i)} \ =

\left[ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \right] \underbrace {\left[ \Phi \left( \frac {\mathbf x_i \boldsymbol \beta - c_i}{\sigma} \right) \right]}_{P(y_i>c_i)} \, + \, c_i \, \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \ =

\mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma} \right) \, + \, c_i \, \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.16}

Example continued

Code

# simulated data - example from above continued
set.seed(1234890)
n <- 1500  # fewer observation to better judge efficiency 

# true parameters, which we want to estimate 
sigma <- 4
beta_0 <- 2
beta_1 <- 1;

x <- rnorm(n, mean = 0, sd = 2); 
eps <- rnorm(n, sd = sigma)

# the true model 
y <- beta_0 + beta_1 * x + eps

d <- data.frame(y = y, x = x)

library(truncreg) # Truncated Regression package
library(censReg) # Censored Regression package

## truncated y = yt, censored y = yc, at c = 1
d$yt <- ifelse(d$y > 1, d$y, NA)
d$yc <- ifelse(d$y > 1, d$y, 1)

## compare estimates for full/truncated/censored data
# ols full data 
full <- lm(y ~ x, data = d)

# ols truncated data
ols <- lm(yt ~ x, data = d) 

# ols censored data
olsc <- lm(yc ~ x, data = d)

# MLE truncated data
trunc <- truncreg(yt ~ x, data = d, point = 1, direction = "left")

# MLE censored data
censored <- censReg(yc ~ x, data = d, left = 1) 

summary(censored)

      
      Call:
      censReg(formula = yc ~ x, left = 1, data = d)
      
      Observations:
               Total  Left-censored     Uncensored Right-censored 
                1500            594            906              0 
      
      Coefficients:
                  Estimate Std. error t value Pr(> t)    
      (Intercept)  2.08976    0.11893   17.57  <2e-16 ***
      x            0.90929    0.05969   15.23  <2e-16 ***
      logSigma     1.38363    0.02540   54.47  <2e-16 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Newton-Raphson maximisation, 5 iterations
      Return code 1: gradient close to zero (gradtol)
      Log-likelihood: -2975.228 on 3 Df

Code

modelsummary( list("OLS full data"=full, "OLS trunc data"=ols, "MLE trunc data"=trunc,
                   "OLS cens data"=olsc, "MLE cens data"=censored), 
              output="flextable", 
              gof_map = c("nobs", "aic", "bic", "rmse"), 
              align = "lddddd", 
              stars = TRUE, 
              fmt = 3)

Table 10.4:
Simulated truncated and sensored data, estimated with OLS and MLE. True values are b0=2, b1=1 and sigma=4. Note, we now have only 1500 observations, compared to 10000 in the previous example. This shows that censored reg is more efficient than truncated reg
	OLS full data	OLS trunc data	MLE trunc data	OLS cens data	MLE cens data
(Intercept)	2.039***	4.716***	2.305***	3.352***	2.090***
	(0.104)	(0.096)	(0.419)	(0.071)	(0.119)
x	0.964***	0.397***	0.827***	0.526***	0.909***
	(0.053)	(0.050)	(0.116)	(0.036)	(0.060)
sigma			3.920***
			(0.201)
logSigma					1.384***
					(0.025)
Num.Obs.	1500	906	906	1500	1500
AIC	8439.7	4391.6	4117.3	7274.4	5956.5
BIC	8455.6	4406.0	4131.8	7290.3	5972.4
RMSE	4.02	2.72	3.55	2.73
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

10.4.3 Tobit model

Strictly speaking, the Tobit model, Tobin (1958), is formally identical to a censored normal regression model. However, there are two distinct features, one minor and one important
- In most cases, the Tobit model is formulated with left-censoring and c_i=0. This is not important and can easily be changed
- More importantly, the Tobit model is based on an optimization model with corner solutions and not on censoring of data above or below a certain threshold due to certain survey designs
- This different starting point has serious consequences for interpreting the estimation results, especially for calculating the relevant partial effects
Let us consider an example for an underlying optimization problem with corner solutions for some agents i which maximize their utility U(y_i, z_i) subject to a budget constraint p_yy_i+p_zz_i \leq I_i
y_i and z_i could be leisure or working time versus consumption of goods, or consumption of tobacco versus other goods, or consumption today versus consumption tomorrow, ect. The critical point are additional non-negativity constraints; it is not possible to choose negative values, hence, y_i≥0 and z_i≥0

Without the non-negativity constraints the solution of this optimization problem for y_i is (assuming a linear solution which is to be expected with certain utility functions)

y_i^* = f(p_y, p_z, I_i, u_i) = \mathbf x_i \boldsymbol \beta + u_i

Here, y_i^* represents the potential choice, without non-negativity constraints and u_i represents unobserved heterogeneity of the individual utility function. However, invoking the non-negativity constraints, we arrive to a standard censored regression model, with y_i being the observed values

y_i^* = \mathbf x_i \boldsymbol \beta + u_i \\ y_i = y_i^* \quad \text{if } \ y_i^* > 0 \\ y_i = 0 \quad \ \text{if } \ y_i^* ≤ 0

The difference to the standard censored regression model is that in this case, the non-negativity arises because the agent simply can not choose negative values of a good - and not because the data are censored for some reasons

This has consequences for the interpretation of the model. In the standard censored regression model we are interested in the population parameter \boldsymbol \beta, which, in the case of censored or truncated data can not be consistently estimated by OLS. Therefore, we need a maximum likelihood approach

In the Tobit model, with an underlying optimization problem with non-negativity constraints, we are typically interested on how much the value of the actual observed variable y (labor supply in hours or expenditures on cars or tobacco) changes due to changes of the explanatory variables.

Partial effects

In particular, there are actually three different partial effects present in model

The effect on the expectation of the latent variable – therefore on potential choices (without non-negativity restriction) and not on actual choices. Typically, this is not what we want in this case

\dfrac {\partial E(y^*|\mathbf x)}{\partial x_j} = \beta_j

The effect on expected actual choices y, but only for uncensored observations, e.g., the choice of already employed people on their hours worked if wage is changing, or the choice about expenditures on tobacco for people which actually are smokers. Sometimes, this is what we want

\dfrac {\partial E(y|\mathbf x, y>0)}{\partial x_j} = \ ?

This is the effect on expected actual choices y, but for all actual values, independently whether they are censored or not. Most of the time, this is what we want

\dfrac {\partial E(y|\mathbf x, y≥0)}{\partial x_j} = \dfrac {\partial E(y|\mathbf x)}{\partial x_j}= \ ?

Hence, typically, it is the last partial effect we are interested in. For instance, if the wage or the number of children is changing, we want to know the effect on the supplied hours worked coming form already employed as well as form people which are not currently working.

Therefore, the last partial effect also incorporates the effect on the probability for doing something or not (e.g., working, smoking, driving with a car, etc.)

\dfrac {\partial P(y_i>0|\mathbf x_i)}{\partial x_j} \, = \, \dfrac {\partial \Phi(\mathbf x_i \boldsymbol \beta/\sigma)}{\partial x_j} \, = \, \dfrac {\beta_j}{\sigma} \phi(\mathbf x_i \boldsymbol \beta/\sigma) \tag{10.17}

We already calculated the expectation E(y_i| \mathbf x_i, y_i≥0), (Equation 10.16). But here, we are dealing with the case of c_i=0, so the third term in Equation 10.16 is zero

E(y_i| \mathbf x_i, y_i≥0) \ = \ E(y_i| \mathbf x_i) \ = \ \mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \ =

\Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \left[ \mathbf x_i \boldsymbol \beta \, + \, \sigma \, \lambda \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \right]

The surprisingly simple partial effect is

\dfrac {\partial E(y_i|\mathbf x_i, y_i≥0)}{\partial x_j} \ = \ \beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \tag{10.18}

Hence, the corresponding coefficient \beta_j has to be multiplied by a unit specific adjustment factor \Phi(\mathbf x_i \boldsymbol \beta / \sigma), which is clearly positive and less than 1
- For calculating the APE, we use the average over all observations of this adjustment factor, i.e.,
\widehat{A P E}_{j} = \hat\beta_{j} \, \frac{1}{n} \sum_{i=1}^{n} \Phi \left(\frac {\mathbf{x}_{i} \hat{\boldsymbol{\beta}}}{\sigma} \right) \tag{10.19}
- For the PEA we use \hat\beta_{j} \, \Phi(\bar{\mathbf x} \boldsymbol \beta / \sigma)

Proof for the partial effect:

Note, that d \Phi(z) / d z = \phi(z) and d \phi(z) / d z = -z \phi(z)

\dfrac {\partial }{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \right ] \ =

\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) + \mathbf x_i \boldsymbol \beta \, \phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \frac {\beta_j}{\sigma} - \sigma\left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \frac {\beta_j}{\sigma} \ =

\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) + \mathbf x_i \boldsymbol \beta \, \phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right)\frac {\beta_j}{\sigma} - \left (\frac{\mathbf x_i \boldsymbol \beta} {1} \right) \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right)\frac {\beta_j}{\sigma} \ =

\beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)

Sometimes, we only want the effect on the choices of already employed people on their hours worked in response to changing wages, or the choice about expenditures on tobacco for people which actually are smokers (the second partial effect mentioned above). In this case we need

E(y_i|\mathbf x_i, y_i>0) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )}

which we already encountered this in Equation 10.8, dealing with the truncated normal distribution. Remember, the function \lambda(z) is the inverse Mills ratio (Equation 10.7) and equals to \phi(z)/\Phi(z)

Then, the partial effect is

\dfrac {\partial}{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) }\right] \ = \beta_j \left[1 - \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) - \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)^2 \right] \tag{10.20}

Once again, the corresponding coefficient \beta_j has to be multiplied by a unit specific adjustment factor [ 1 - ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) - \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} )^2 ], which can be shown to be strictly between zero and one. For calculating the APE, we use the average over all observations of this adjustment factor. For the PEA we simply calculate the expression at the mean \bar{\mathbf x}

Proof for the partial effect:

Once again note, that d \Phi(z) / dz = \phi(z), \; d \phi(z) / d z = -z \phi(z) \, and \, \lambda(z)=\phi(z)/\Phi(z)

\text{ } \\ \dfrac {\partial}{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) }\right] \ =

\beta_j + \dfrac{- \sigma ( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma} \, \Phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) - \sigma \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma}\phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )} {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } =

\beta_j + \left[ - \dfrac{ \sigma ( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \frac {\beta_j}{\sigma}\Phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) } {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } - \dfrac{ \sigma \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma} \phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )} {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } \right] \ =

\beta_j \left[1 - \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) - \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)^2 \right]

10.4.4 Example

Now, we are estimating a concrete example using the MROZ data, which we have already studied in the context of Logit and Probit models. We are interested in the number of working hours supplied by married woman, hours, and the effects of several explanatory variables

These variables are nwifeinc = non wife income, educ = education in years, exper = experience in years, age, kidslt6 = number of kids younger than 6 and kidsge6 = number of kids older than 6
We estimate the model with OLS, Tobit and Probit. The reason for the Probit estimation is a special feature of the Tobit model:
- The Tobit estimates implicitly incorporates the choice of doing something (for instance, the probability of working, depending on some variables) as well as the response of the intensity of doing something (e.g., the response of working time on education)
- These two effects are: {\partial P(y_i>0|\mathbf x_i)}/{\partial x_j} = (\beta_j/\sigma) \phi(\mathbf x_i \boldsymbol \beta/\sigma) (Equation 10.12, Equation 10.17) and {\partial E(y_i|\mathbf x_i, y_i>0)}/{\partial x_j} = \beta_j [ 1 - ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) - \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} )^2 ] (Equation 10.20) and hence, are represented by the same parameter \beta_j times a positive constant
- Therefore, every variable in the model affects the participation probability and the intensity equation with the same sign. Further, the Probit parameters should be approximately equal to \hat \beta_j/\hat \sigma in terms of the Tobit estimates. So, it is always a good idea to additionally estimate a corresponding Probit model (coding y_i > 0 as 1) and check whether the estimated coefficient actually have the same sign. If there are some coefficients with statistically significant different signs, then the Tobit model might not be appropriate

library(boot)
library(wooldridge) 
data("mroz")

# OLS
outols <- glm(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, x=TRUE) 

# Tobit model - default of `tobit` from `AER` package: left = 0. 
# You could also use `censReg` -> same result
outtobit <-  tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)


# with censReg - same result but logSigma is reported (but avg_slopes does not work)
#outtobit1 <-  censReg(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz)


# you can calulate PEA
# margEff(outtobit1)

Code

########## for APEs for modelsummary() 

## bootstrap only for std.errors necessary 

bootbdy <- function(data, indices) {
   dat <- data[indices,]  # resampled data
   out <-  tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, 
                         data = dat) 
   m_tobit <- avg_slopes(out)
   fac <- mean( pnorm( predict(out)/out$scale ), na.rm = TRUE ) # scaling factor, eq. 10.7 
   m_tobit["estimate"] = m_tobit["estimate"]*fac
   return( coef(m_tobit) )
}

bootresult_ <- boot(data=mroz, statistic=bootbdy, R=500)

se_boot <- matrixStats::colSds(bootresult_$t) # for debugging, should be the same as mm

mm = summary(bootresult_)  # to calculate std.errors

## to use `modelsummary`, making a data.frame (with naming according broom package)
m_tobit1 = data.frame(term=names(bootresult_$t0), estimate=mm$original, std.error=mm$bootSE, statistic=mm$original/mm$bootSE, p.value= (1 - pnorm( abs(mm$original/mm$bootSE) ))*2  )

## modelsummary needs a special format of a list with tidy and glance
m_tobit <- list(  tidy = m_tobit1,  glance = data.frame(NA)  )
class(m_tobit) <- "modelsummary_list"

############## 



# Probit model - to check for misspecification
outprobit <- glm( inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, 
                  data = mroz, family = binomial(link = "probit"))


# Average partial effects for Probit with 'avg_slopes()'
marg_probit <- avg_slopes(outprobit)

# Average partial effects for Tobit with 'avg_slopes()', but without correction factor fac
marg_tobit <- avg_slopes(outtobit)

Code

rows <- tibble::tribble(~term, ~Bivariate, ~Multivariate, ~ab, ~cd, ~ef,
  'Empty row', '-', '-', '-', '-', '-', 
  'Another empty row', '?', '?', '?', '?', '?')
#attr(rows, 'position') <- c(1, 6)

modelsummary( list("OLS"=outols, "Tobit"=outtobit, "Tobit APE"=m_tobit, "Probit"=outprobit, "Probit APE"=marg_probit), 
              output="gt", 
              #statistic = c('std.error', 'statistic', 'p.value'), 
              #coef_omit = "Inter",
              gof_omit = "F|N", 
              align = "lddddd", 
              stars = TRUE, 
              fmt = 3,
              add_rows = rows
              )

	OLS	Tobit	Tobit APE	Probit	Probit APE
(Intercept)	1330.482***	965.305*		0.270
	(270.785)	(446.436)		(0.508)
nwifeinc	-3.447	-8.814*	-5.189*	-0.012*	-0.004*
	(2.544)	(4.459)	(2.593)	(0.005)	(0.001)
educ	28.761*	80.646***	47.473***	0.131***	0.039***
	(12.955)	(21.583)	(12.649)	(0.025)	(0.007)
exper	65.673***	131.564***	54.115***	0.123***	0.026***
	(9.963)	(17.279)	(4.945)	(0.019)	(0.002)
I(exper^2)	-0.700*	-1.864***		-0.002**
	(0.325)	(0.538)		(0.001)
age	-30.512***	-54.405***	-32.026***	-0.053***	-0.016***
	(4.364)	(7.419)	(4.259)	(0.008)	(0.002)
kidslt6	-442.090***	-894.022***	-526.278***	-0.868***	-0.261***
	(58.847)	(111.878)	(73.498)	(0.118)	(0.032)
kidsge6	-32.779	-16.218	-9.547	0.036	0.011
	(23.176)	(38.641)	(23.421)	(0.044)	(0.013)
R2	0.266
AIC	12117.1	7656.2		818.6	818.6
BIC	12158.7	7697.8		855.6	855.6
Log.Lik.	-6049.534			-401.302	-401.302
RMSE	746.18	953.68		0.42	0.42
Left_censored		325
Un_censored		428
Right_censored		0
Total		753
Scale		1122.02***
Empty row	-	-	-	-	-
Another empty row	?	?	?	?	?
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Figure 10.7: Number of working hours supplied by married woman, hours, and the effects of several explanatory variables, estimated with OLS, Tobit, Tobit-APEs (Equation 10.19) and a Probit model for checking signs of parameters

Code

#### Code for `texreg` procedure - not reported 
library(texreg)

# Calculating APEs for tobit model 
# Basically with: 
#   fav <- mean( pnorm( predict(outtobit)/outtobit$scale ), na.rm = TRUE  )
#   APE <- fav * coef(outtobit)

source('~/Dropbox/R/User Functions/extract.margins.R')   # user function for texreg
source('~/Dropbox/R/User Functions/extract.tobitMargins.R')   # user function for texreg

# APE for Tobit with bootstrapping for standard errors. User fn tobitAPE is integrated in extract.tobitMargins.R 
bootbody <- function(data, indices) {
   out <-  tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, 
                 data = mroz[indices,])
   return( tobitAPE(out) )
 }
bootresult <- boot(data=mroz, statistic=bootbody, R=1000)


# preparing for texreg to know tobit_output (append classes: "tobitMarg","APE")
tobit_out <- outtobit
class(tobit_out) <- c("tobitMarg","APE", "tobit", "survreg", "margins")


screenreg( list(outols, outtobit, tobit_out, outprobit, marg_probit ),
           custom.model.names = c("OLS", "Tobit", "Tobit APE", "Probit", "Probit APE"), 
            digits = 3, 
           bootresults=bootresult, discrete=c(9)
           )

Because of the different scaling factors involved, Tobit coefficients are not directly comparable to OLS coefficients. For instance, look at the OLS estimate for \beta_{educ} which is 28.8 and the Tobit estimate of 80.6. But this is the estimated effect on the latent variable, which is not really relevant in this context
To compare Tobit and OLS, one has to compare average partial effects, APEs, Equation 10.18. It turns out that the partial effects of Tobit and OLS are different not in all, but in a number of cases. For instance, the effect of education on E(y|\mathbf x), the expectation of observed hours worked, is estimated to be 47.5
Another difference between Tobit and OLS is that, due to the linearity of the model, OLS assumes constant partial effects, whereas partial effects are nonconstant in Tobit. To demonstrate this, look at Figure 10.8 below. There, the different predictors for hours worked are displayed, along with the observed points. The most relevant one is the blue line which represents E(y|\mathbf x). Note, how the blue line converges to the red one, the linear predictor \mathbf x \hat {\boldsymbol \beta} of the Tobit model, as the censoring becomes less imortant. Further note, the OLS predictor approximate E(y|\mathbf x) quite good for some range of \mathbf x
Lets us compare the estimates of the Tobit model with the Probit ones. It turns out that every coefficient has the same sign. Further, dividing the Tobit estimate for \beta_{educ} by the Tobit estimate for \sigma (which is 1122) we get 0.072 compared to 0.13 for Probit. Doing the same for \beta_{kidslt6} we get -0.8 compared to -0.87. This is not a huge difference but shows that the initial decision to work is stronger affected by education or small kids than the decision about the hours worked once in the labor force

Code

 tobitPredict <- function(x, model) {   # calculates pred values for tobit model
   pred <- pnorm( x/model$scale ) * x + model$scale * dnorm( x/model$scale )
  return(pred) }

 tobitPredict0 <- function(x, model) {   # calculates Lambda values for tobit model
   pred0 <- x + model$scale * dnorm( x/model$scale ) /  pnorm( x/model$scale )
  return(pred0) }

olsout2 <- lm(mroz$hours ~ outtobit$linear.predictors)

plot( outtobit$linear.predictors, mroz$hours, ylim=c(-1000, 5000), pch = 16, cex=0.9,
      col=c("#666666", "#666666", "#666666"), xlab="Linear tobit predictor X*beta",
      ylab="Hours and predictors of hours", 
      main = "Different predictors for hours worked" )
abline(c(0,0),c(1,1), col="red", lwd=2.5)
curve(tobitPredict(x, outtobit), from = -3000, to = 2200, n = 150, add = TRUE, 
      type = "l", col="blue", lwd = 4.5)
abline(olsout2, col="purple", lwd =2)
abline(0,0) 
curve(tobitPredict0(x, outtobit), from = -3000, to = 2200, n = 150, add = TRUE, 
      type = "l", col="cornflowerblue", lwd = 3)
text(-2500, 640, "E(y|y>0)")
text(-600, -950, "E(y*)")
text(-1100, 280, "E(y|y≥0) = E(y)")
text(-1950, -690, "OLS")

Figure 10.8: E(y|y≥0) (blue), linear tobit predictor X*beta (red), OLS predictor (purple) and E(y|y>0) (light blue)

10.5 Incidental truncation

Until now, we dealed with the problem of truncated and censored data which is directly linked to the dependent variable y and lead to a sample that is not random any more; this could renders OLS estimates inconsistent.

Remember, sample selection poses no problem if it is solely linked to exogenous variables or is exogenous to the errors ³
- To formally show this result, define the indicator s_i, which is one if the observation i is in the sample and otherwise is zero. Then we have for s_i=1 (for fully observed observations, including \mathbf x_i)

s_i y_i \, = \, s_i \mathbf x_i \boldsymbol \beta + s_i u_i \ \ \Rightarrow

E(s_i y_i|\mathbf x_i, s_i=1) \, = \, E(s_i \mathbf x_i \boldsymbol \beta| \mathbf x_i, s_i=1) + E(s_i u_i|\mathbf x_i, s_i=1) \ \ \Rightarrow

E(y_i|\mathbf x_i, s_i=1) \, = \, \mathbf x_i \boldsymbol \beta + \underbrace {E(u_i|\mathbf x_i, s_i=1)}_{\text {should be } 0} \, = \, \mathbf x_i \boldsymbol \beta

So, if the expectation of u_i is not influenced by s, the conditional expectation of y is equal to \mathbf x_i \boldsymbol \beta and thus, \boldsymbol \beta can be consistently estimated by OLS with the available sample
- The expectation of u_i is not influenced by s if
  1. s is only a function of the exogenous \mathbf x (exogenous sample selection), in which case the condition E(u_i|\mathbf x_i, s_i=1)=0 reduces to the usual condition E(u_i|\mathbf x_i)=0
  2. s is purely random (random sample selection), in which case the condition E(u_i|\mathbf x_i, s)=0 is met
  3. s is a mixture of both cases above

Incidental truncation or sample selection is a mechanism where sample selection is only indirectly linked to the dependent variable y or its error u; that means via other variables.

Very often but not always, this is due to self-selection
The leading example is the following: We want to estimate the person’s market wage what she could command in the labor market. This market wage would depend on her education, experience, age, etc. But if the wage offered by the market is too low, meaning that the market wage is lower than her reservation wage, she will not work at all and her potential market wage is not observed
- This is clearly a self-section mechanism and a first stage decision determines whether the wage for this particular person is observed or not. The sample is not random any more and in particular, is linked to the unobserved reservation wage
- People with low potential market wage or high reservation wage might therefore be heavily underrepresented in the sample
  - And if the factors deciding labor force participation influence the potential market wage as well we will get biased results regarding the factors determining the market wage
A solution to this problem is to define a separate model describing the decision for working versus non-working to get a clue whether a particular person will probably work and try to make an adjustment for the underrepresented cases
Hence, we will have two models, a selection or participation model and the outcome model (or sometimes called intensity model) which we are primarily interested in

10.5.1 Heckit model

The selection model (participation model) can be modeled as a simple Probit model; we want to estimate the conditions, from which on the person is expected to work (or more generally to participate). This is the case when the probability for working is larger than for non-working. Naturally, this leads to a probit model:

s_i^*=\mathbf z_i \boldsymbol \gamma + e_i, \ \ \ \ s_i = \begin{cases} 1 \ \text{ if } \ s_i^*>0 \\ 0 \ \text{ if } \ s_i ≤ 0 \end{cases} \tag{10.21}

Here, s_i^* is a latent variable, which in our example could be the difference between potential market wage and reservation wage
We predict labor market participation, s_i=1, if s_i^*>0, i.e., the market wage is larger than the reservation wage. Clearly, we predict out of labor force, s_i=0, if s_i^*≤0. In particular, s_i=1 if

e_i > - \mathbf z_i \boldsymbol \gamma \tag{10.22}

The vector \mathbf z_i contains the explanatory variables for participation in the labor market (in our example the number of kids, income of husband, age, etc.), \boldsymbol \gamma is the corresponding parameter vector to be estimated and e_i represents unobserved heterogeneity of the utility functions and especially, information about the unobserved reservation wage

The equation of primary interest is the outcome equation (sometimes called intensity equation), in our example the wage equation:

y_i^* = \mathbf x_i \boldsymbol \beta + u_i, \ \ \ y_i = \begin{cases} y_i^* \ \text{ if } \ s_i=1 \\ \text{unobserved if } \ s_i = 0 \end{cases} \tag{10.23}

Here, y_i^* is the latent dependent variable, in our example the potential market wage, \mathbf x_i are the explanatory variables and u_i represents unobserved heterogeneity of the individuals which too affects the market wage
The observation rule says that y_i is only observed if s_i=1, i.e. s_i^*>0, which means participation in the labor market in our example. The value of s_i is determined in the first stage decision problem, the selection equation
Note, we assume that \mathbf z_i is observed for all i in the sample, as otherwise we couldn’t estimate the selection problem
According to the selection model (Equation 10.21), besides \mathbf z_i, the unobserved error e_i determines whether the market wage for person i is observed or not. Therefore, we have an incidental sample selection because these factor are not directly linked to y_i or u_i

We assume that the errors from the section and outcome equations, e and u, are jointly normal distributed with correlation \rho. This means that the joint distribution is solely described by the mean vector and the covariance matrix and thus rules out any nonlinear relationships between e and u:

\binom{e}{u} \ \sim \ N\left ( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 1 & \rho \, \sigma_u \\ \rho \, \sigma_u & \sigma_u^2 \end{bmatrix} \right ) \tag{10.24}

Note, \sigma_e^2 = 1 is no restriction of generality as this variance is not identified in a Probit model and thus can be arbitrarily normalized (we already discussed this point in the chapter on binary choice models)
We assume, that e_i as well as u_i are independent of (\mathbf x_i, \mathbf z_i), thus E(u_i | \mathbf x_i, \mathbf z_i)=0 and E(e_i | \mathbf x_i, \mathbf z_i)=0, MLR.4 (see Section 2.7)
Whether or not OLS applied to the wage equation leads to consistent and asymptotically normal distributed estimates of \boldsymbol \beta depends on whether the selection mechanism via e_i is exogenous (uncorrelated) with respect the error term u_i (and thus with y_i)
Therefore, the correlation coefficient \rho is crucial for this problem. If \rho \neq 0, a sample selection bias arises

To show the importance of \rho we decompose u_i into its conditional expectation on e_i and an error term \epsilon_i, which is uncorrelated with e_i by construction and uncorrelated with \mathbf x_i, \mathbf z_i by assumption (as u_i is independent of (\mathbf x_i, \mathbf z_i))

u_i \, = \, E(u_i|e_i) + \epsilon_i, \quad \text{with } \, E(\epsilon_i|\mathbf x_i, \mathbf z_, e_i)=0

Because (u_i,e_i) are jointly normal, E(u_i|e_i) is equal to the linear projection (regression) of u_i on e_i, with the projection coefficient \sigma_{u,e}/\sigma_e^2 (with \sigma_{u,e} = covariance of u and e)
- This reduces to \sigma_{u,e}, as \sigma_e^2 is normalized to one. Considering the definition of the correlation coefficient, \rho := \sigma_{u,e}/(\sigma_u \sigma_e), we get

u_i \, = \, \sigma_{u,e}e_i + \epsilon_i \, = \, \rho \, \sigma_u e_i + \epsilon_i

Therefore, the outcome Equation 10.23 can be rewritten as

y_i^* = \mathbf x_i \boldsymbol \beta + \rho \, \sigma_u e_i + \epsilon_i \tag{10.25}

Taking the conditional expectation of Equation 10.25 for the observed sample (s_i=1 \ \Rightarrow \ e_i>-\boldsymbol \gamma \mathbf z \,) we get (note, e_i \perp (\mathbf x_i, \mathbf z_i) by assumption)

E(y_i^*|\mathbf x_i, \mathbf z_i, s_i=1) \ = \ \mathbf x_i \boldsymbol \beta \, + \, \rho \sigma_u E(e_i|s_i=1) \, + \, \underbrace {E(\epsilon_i|\mathbf x_i, \mathbf z_i, \overbrace {s_i=1}^{e_i>-\boldsymbol \gamma \mathbf z})}_{= \,0, \text{ as } \epsilon_i \, \perp \, (\mathbf x_i, \mathbf z_i,e_i) }

According to the selection model (Equation 10.21), the term before the last is:

\rho \sigma_u E(e_i|s_i=1) \, = \, \rho \sigma_u E(e_i|s_i^*>0) \, = \, \rho \sigma_u E(e_i|e_i>-\mathbf z_i \boldsymbol \gamma)

But this is the expectation of a truncated standardized normal distribution and hence, we can apply the theorem about truncated normal distributions, Theorem 10.1 (with y=e, \mu=0, c=-\mathbf z_i \boldsymbol \gamma and \sigma_{(e)}=1)

\rho \sigma_u E(e_i|e_i>-\mathbf z_i \boldsymbol \gamma) \, = \, \rho \sigma_u \lambda(\mathbf z_i \boldsymbol \gamma)

with \lambda(\mathbf z_i \boldsymbol \gamma) being the inverse Mills Ratio (see definition in Equation 10.7)

Putting things together, we finally arrive to the conditional expectation of observed y_i, which is the population regression in this case:

E(y_i|\mathbf x_i, \mathbf z_i, s_i=1) \, = \, \mathbf x_i \boldsymbol \beta + \rho \sigma_u \lambda(\mathbf z_i \boldsymbol \gamma) \tag{10.26}

Note, the similarities to Equation 10.8, which is not accidental as this too is a case of a truncated regression problem

Regarding to our leading example, if the correlation of unobserved wage determinants and unobserved determinants of the work decision is positive (\rho>0), the women observed working will have higher wages than expected solely from their characteristics (= positive selection). And this effect is represented by \lambda(\mathbf z_i \boldsymbol \gamma), which is strictly positive

Furthermore, the correction factor \lambda(\mathbf z_i \boldsymbol \gamma) will be higher for observations with a low probability to be in the sample (because of a low \mathbf z_i \boldsymbol \gamma, \ \lambda'< 0), but are nonetheless in the sample because of a high e_i, which represents unobserved heterogeneity and is possibly correlates with error u_i

As Equation 10.26 show, if \rho \neq 0, regressing y_i solely on \mathbf x_i leads to a sample selection bias in form of an omitted variable bias; the omitted variable \lambda(\mathbf z_i \boldsymbol \gamma ) is a nonlinear function of \mathbf z_i and therefore probably correlated with \mathbf x_i (as \mathbf x_i is usually a part of \mathbf z_i)

In this case, there are unobserved factors affecting both the selection as well as the outcome equation. This is the cause of this type of sample selection bias

Finally, let us discuss another aspect more closely by means of our example. We argued that s_i^* in the selection equation could be interpreted as the difference between between market wage offers and reservation wage, (w_i - w_i^r). But according to the outcome Equation 10.23, the market wage offers are dependent on \mathbf x_i and on u_i. This implies two things

Thus, the variables in the outcome equation \mathbf x_i obviously plays some role in the selection equation (via market wage offers). Therefore, the variables \mathbf x_i should be part of the variables in \mathbf z_i, the variables of the selection equation. And this inclusion of all \mathbf x in \mathbf z is generally recommended as standard procedure
A very similar argument can be put forward for the unobserved factors of the outcome equation, u_i. These too influence the market wage offers and therefore the decision to work or not. Thus, these factors are also a part of e_i, the unobserved factors of the selection equation. Hence, the error terms u_i and e_i are expected to be positively correlated in our example, which implies \rho > 0

Estimation

The model introduced above is called Hackman model or Heckit, Heckman (1979). It is a generalization of the Tobit model. In fact, if you set \mathbf x_i = \mathbf z_i, u_i = e_i and censoring y_i to zero for cases not observing y_i, we arrive to the Tobit model. For this reason the Hackman model is also sometimes called Tobit II model.

There are two procedures to estimate the model, a 2-step procedure and maximum likelihood estimation. Surprisingly, the 2-step procedure is much more common in practice. This procedure is as follows
1. Estimate the selection model (Equation 10.21) by probit. This determines if the dependent variable is expected to be observed. The probability for y_i to be observed is
P(s_i=1|\mathbf z_i) = \Phi(\mathbf z_i \boldsymbol \gamma) \ \rightarrow \ \hat {\boldsymbol \gamma}
1. Calculate the inverse Mills ratio using the estimated probit coefficients for every observation
\hat\lambda({\mathbf z}_i \hat{\boldsymbol \gamma}) = \dfrac {\phi({\mathbf z}_i \hat{\boldsymbol \gamma})}{\Phi({\mathbf z}_i \hat{\boldsymbol \gamma})}
1. Estimate the outcome Equation 10.23, augmented by the estimated inverse Mills Ratio as a correction factor, by OLS. This yields consistent estimates of \boldsymbol \beta. The coefficient \alpha is an estimate for \rho \sigma_u of Equation 10.26
y_i \, = \, \mathbf x_i \boldsymbol \beta \, + \, \alpha \, \hat\lambda(\mathbf z_i \hat{\boldsymbol \gamma}) \, + \, error \tag{10.27}
1. A complication arises with the calculation of correct standard errors for regression Equation 10.27. Besides an inherent heteroscedasticity of the error, one have to take into account that \hat \lambda is itself estimated with the help of the estimated parameters \hat {\boldsymbol \gamma} of the selection equation, which introduces additional variation and correlations across observations. For details see Green (2017), chapt. 19, p. 994

The whole procedure sounds a little bit complicated, but there are standard routines for both STATA and R. The corresponding R package is sampleSelection with the procedure heckit. These routines also calculate the correct standard errors for the 2-step procedure.

The estimated coefficient \hat \alpha of \hat \lambda in Equation 10.27 can be used to test for a sample selection bias. If this parameter is not statically significant different from zero, then there seems to be no sample selection bias as \hat \alpha is an estimate for \rho \sigma_u hence, \hat \alpha = 0 \Rightarrow \rho = 0
However, this whole 2-step procedure and testing for a sample selection should be carried out with great care. The reason is that there is an inherent identification problem; the variable \lambda is a nonlinear function (but over a wide range nearly a linear function) of the variables in \mathbf z. As we learned above, \mathbf x is generally contained in \mathbf z. Thus, we often have a server multicollinearity problem in the estimated outcome Equation 10.27, leading to very imprecise parameter estimates and powerless tests
To reduce the identification problem, we always should have variables in \mathbf z, which are not contained in \mathbf x, helping to identify the coefficients of \mathbf x in Equation 10.27 – in this case, \hat \lambda is not any more a nearly linear combination of the variables in \mathbf x
In practice, it is often hard to find variables that play a role in the selection model but credibly do not play a role in the outcome model. Erroneously omitting variables form the outcome model can lead to severe biases

Example

We argued, that \mathbf z should contain at least one variable which is not contained in \mathbf x. Let us discuss this aspect by means of our example, which is estimated below.

We are interested in the market wage offers of married woman. We assume that education and experience play an important role in this respect. So, this is our outcome equation
To avoid a sample selection problem (because we only observe wages of employed woman), we apply Heckman’s 2-step procedure. Thus, we additionally specify a selection equation for the decision to work, to estimate the correction factor \lambda. We argued above, that all variables of the outcome equation should be included in the selection equation because possible wage offers are clearly an important factor for the decision to work as well
Therefore, the selection model contains at least education and experience as explanatory variables. However, to better identify the parameters of the outcome equation, we additionally need variables that are important for the decision to work and do not play a role for the market wage offers
- In this case, we find such variables; we added the non-wife income (income of the husband), the number of children under six and over six and age as explanatory variables to the selection model. These variables should have an effect on the reservation wage and therefore on the decision to work, but not on the wage offers in the market

library(sampleSelection)
library(wooldridge)
data("mroz")

# OLS estimation
ols <-  lm(log(wage) ~ educ + exper + I(exper^2), data = mroz)

# Heckit 2 Step model (default); the first line is the selection equation, 
# the second line the outcome equation
heck   <-  heckit(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
                log(wage) ~ educ + exper + I(exper^2),
                data = mroz) 
                
summary (heck)

      --------------------------------------------
      Tobit 2 model (sample selection model)
      2-step Heckman / heckit estimation
      753 observations (325 censored and 428 observed)
      15 free parameters (df = 739)
      Probit selection equation:
                   Estimate Std. Error t value Pr(>|t|)    
      (Intercept)  0.270077   0.508593   0.531  0.59556    
      nwifeinc    -0.012024   0.004840  -2.484  0.01320 *  
      educ         0.130905   0.025254   5.183 2.81e-07 ***
      exper        0.123348   0.018716   6.590 8.34e-11 ***
      I(exper^2)  -0.001887   0.000600  -3.145  0.00173 ** 
      age         -0.052853   0.008477  -6.235 7.61e-10 ***
      kidslt6     -0.868328   0.118522  -7.326 6.21e-13 ***
      kidsge6      0.036005   0.043477   0.828  0.40786    
      Outcome equation:
                    Estimate Std. Error t value Pr(>|t|)    
      (Intercept) -0.5781032  0.3050062  -1.895  0.05843 .  
      educ         0.1090655  0.0155230   7.026 4.83e-12 ***
      exper        0.0438873  0.0162611   2.699  0.00712 ** 
      I(exper^2)  -0.0008591  0.0004389  -1.957  0.05068 .  
      Multiple R-Squared:0.1569,    Adjusted R-Squared:0.149
         Error terms:
                    Estimate Std. Error t value Pr(>|t|)
      invMillsRatio  0.03226    0.13362   0.241    0.809
      sigma          0.66363         NA      NA       NA
      rho            0.04861         NA      NA       NA
      --------------------------------------------

# Heckit MLE model
heckml <-  heckit(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6,
                log(wage) ~ educ + exper + I(exper^2),
                data = mroz, method = "ML")

Code

library(texreg)
screenreg( list(ols, heck, heckml),  digits = 4, custom.model.names = c("OLS", "Heckit 2Stp", "Heckit MLE"), 
           prefix = FALSE, include.selection = FALSE, include.rsquared=FALSE )

      
      =========================================================
                      OLS           Heckit 2Stp   Heckit MLE   
      ---------------------------------------------------------
      (Intercept)      -0.5220 **    -0.5781        -0.5527 *  
                       (0.1986)      (0.3050)       (0.2604)   
      educ              0.1075 ***    0.1091 ***     0.1084 ***
                       (0.0141)      (0.0155)       (0.0149)   
      exper             0.0416 **     0.0439 **      0.0428 ** 
                       (0.0132)      (0.0163)       (0.0149)   
      exper^2          -0.0008 *     -0.0009        -0.0008 *  
                       (0.0004)      (0.0004)       (0.0004)   
      invMillsRatio                   0.0323                   
                                     (0.1336)                  
      sigma                           0.6636         0.6634 ***
                                                    (0.0227)   
      rho                             0.0486         0.0266    
                                                    (0.1471)   
      ---------------------------------------------------------
      Adj. R^2          0.1509        0.1490                   
      Num. obs.       428           753            753         
      Censored                      325            325         
      Observed                      428            428         
      AIC                                         1693.7702    
      BIC                                         1758.5071    
      Log Likelihood                              -832.8851    
      =========================================================
      *** p < 0.001; ** p < 0.01; * p < 0.05

In our example, the coefficient of \lambda, the inverse Mills Ratio, is not significant. Thus, there is no significant sample selection bias. This is the reason, why the estimates of OLS and Heckit are very similar
In the table, the results of a maximum likelihood estimation are also reported. As mentioned above, the 2-step procedure is much more common in empirical work, although the MLE is more efficient, especially with regard to an estimate of \rho. The downside of MLE is that this is said to be more prone to specification errors, for instance with regard to the normal distribution assumption of the error terms
The likelihood function of the Heckman model consists of two parts: P(s_i=0), this is the Probit part, and the density of y_i, given s_i=1 times P(s_i=1)

\ln L = \sum_{s_i=0} \ln (1 - \Phi(\mathbf z_i \boldsymbol \gamma)) + \sum_{s_i=1} \ln f(y_i|s_i=1) \Phi(\mathbf z_i \boldsymbol \gamma)

As we have an incidental truncation, f(y_i|s_i=1) = f(y_i|e_i>-\mathbf z_i \boldsymbol \gamma) is a more complicated term based on a bivariate truncated normal distribution (see Green (2017), chapt.19, p. 995)
Last but not least, it should be noted that many of the methods for limited depended variables and sample selection discussed can be extended to 2SLS and some also to panel data (for details see Green (2017))

For a truncation from the right (from above) we have: \lambda(x) =\dfrac {-\phi(x)}{\Phi(-x)}.↩︎
Hint: Let \alpha = (\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma}). Then \lambda'(\alpha)=\frac {-\alpha \phi(\alpha) \Phi(\alpha) - \phi(\alpha)^2}{\Phi(\alpha)^2} = -\alpha \lambda(\alpha) - \lambda(\alpha)^2 = (-) and one can show that: -1<\lambda'(\alpha)<0.↩︎
The following results carry over to IV-estimations as well, if E(u|\mathbf z, s)=0, with \mathbf z being the (external + internal) instruments.↩︎

--- title-block-banner: true subtitle: "Mainly based on @wooldridge_Intro_Econometrics, Chapter 17 and @green2017" --- # LimDepVars ##### **Limited dependent variables** (**LDV**) are variables which are substantively restricted in their domain, for instance: - Binary variables, e.g., employed/not employed - Non-negative variables, e.g., wages, prices, interest rates - Non-negative variables with excess zeros, e.g., labor supply - Count variables, e.g., the number of arrests in a year - Censored variables, e.g., unemployment duration We further have ##### **Sample selection models** - The sample used to infer population relationships is **endogenously** selected, e.g., a wage offer regression but wage data are only observable if people are actually working - The resulting sample is not random any more and could lead to biased estimates ------------------------------------------------------------------------ ## Binary **dependent** variable ### The linear probability model (LPM) Until now, we only analysed models with dummy variables on the **right hand side** -- as **explanatory** variables. Now, we analyze models with a **binary dependent** variable, i.e., a **dummy variable** as **left** hand side variable. - For instance, $y$ may represents *labor force participation of married women*, which is one if the woman is in the labor force and is zero if not. If we assume a linear model we have $$ y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k + u $$ Under MLR.4, we have as usual (population regression line): $$ \Rightarrow \ E(y \mid \mathbf x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k = \, \mathbf x \boldsymbol \beta $$ As the dependent variable only takes on the values 0 and 1 we additionally have in this case: $$ E(y \mid \mathbf x) := \, 1 \cdot P(y=1 \mid \mathbf x) + 0 \cdot P(y=0 \mid \mathbf x) $$ ::: {.callout-important appearance="simple" icon="false"} $$ \Rightarrow \ P(y=1 \mid \mathbf x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k $$ {#eq-lpm_E} ::: > This is the **linear probability model**, **LPM**: The model estimates the probability that $y=1$; in our example that a married woman is part of the labor force. - The coefficient $\beta_j$ describes the partial effect of the explanatory variable *j* on the probability that $y=1$ $$ \beta_j = \dfrac{\partial P(y=1\mid \mathbf x)}{\partial x_j} $$ - This model was already studied in some detail in @sec-LPM. You should look there especially for the estimated example ------------------------------------------------------------------------ ```{r} #| code-fold: true library(AER); library(effects); library(wooldridge) data(mroz) ``` ```{r} #| code-fold: true # LPM for labor forece participation of married woman outm <- lm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz) ``` ```{r} #| code-fold: true # We calculate squared residuals and fitted values for plotting ressq <- outm$residuals^2 yhat <- outm$fitted.values ``` ------------------------------------------------------------------------ #### Measures of fit for binary dependent variables - A popular *measure of fit* for such models is the **hit rate**, which measures the portion of correctly predicted observations of the dependent variable (which has only zeros or ones) - Thereby, whenever $\hat y_i$ (which is the estimated probability that $y_i=1$) is larger than the cutoff 0.5, the model predicts $y_i=1$ and when $\hat y_i <0.5$, the model predicts $y_i=0$ - Besides the overall hit rate, one can *separately* examine, how the hit rates are for values of $y_i=1$ and $y_i=0$ - The portion of correctly predicted $y_i$s, when $y_i=1$, is called **sensitivity**. This is the ability of the model to detect true "positives" (e.g., the percentage of people predicted to be in the workforce and actually are or, as an example from medicine, - the percentage of sick people who are correctly identified by a tumor marker as having the condition) - The portion of correctly predicted $y_i$s, when $y_i=0$, is called **specificity**. This is the ability of the model to detect true "negatives" (e.g., the percentage of people predicted not to be in the workforce and actually are not or, as an example from medicine, - the percentage of healthy people who are correctly identified by a tumor marker as not having the condition) - Below, we calculate the hit rates described above for the a LPM, and for purposes of comparison, for a corresponding Probit model (discussed below). The results are nearly identical -- (the code for this hit-rate function is listed below and could by used by any user) ------------------------------------------------------------------------ ```{r} #| code-fold: true hitrate <- function(outm, cutoff=0.5, stragaz=TRUE) { # This function calculates overall hit rates, sensitivity and specificity for # LPM, Logit and Probit models # outm = output list of `lm()` or `glm()` # cutoff = threshold value for predicting 1, default = 0.5 #requires package stargazer # load dependent variable of model y <- outm$model[[1]] # load fitted values yhat <- outm$fitted.values ly <- length(y) # prediction of one, if yhat > threshold ypred <- (yhat > cutoff)*1 # y which are 1 y_1 <- y[y==1] ly_1 <- length(y_1) # y which are zero y_0 <- y[y==0] ly_0 <- length(y_0) # prediction error prederror <- y - ypred # overall hits hit_o <- rep(0, ly) hit_o <- ifelse( (ypred == 1 & y == 1) | (ypred == 0 & y == 0) ,1, 0) hitrate_o <- sum(hit_o)/ly # correct prediction if y=1 hit_1 <- rep(0, ly) hit_1 <- ifelse((ypred == 1 & y == 1), 1, 0) hitrate_1 <- sum(hit_1)/ly_1 # correct prediction if y=0 hit_0 <- rep(0, ly) hit_0 <- ifelse((ypred == 0 & y == 0), 1, 0) hitrate_0 <- sum(hit_0)/ly_0 # storing results in data frame results <- data.frame(HitRate = hitrate_o, Sensitivity = hitrate_1, Specificity = hitrate_0, Nof1s = ly_1, Nof0s = ly_0, check.rows=TRUE) if (stragaz==TRUE) { stargazer::stargazer(results, type = "text", summary = FALSE, rownames = FALSE) } return(results) } ``` ------------------------------------------------------------------------ ```{r} # Hit rate of LPM hr_lpm <- hitrate(outm) ``` ```{r} # For comparison, estimate a corresponding Probit model # which we introduce below outprobit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, family = binomial(link = "probit")) # Hit rate of Probit model hr_probit <- hitrate(outprobit) ``` ------------------------------------------------------------------------ ### Probit and Logit models The main problems of the Linear Probability Model (due to the linearity in $\mathbf x$) are - Predictions sometimes lie outside the unit interval - Partial effects of explanatory variables are constant An obvious **solution** for these problems is a **nonlinear model for binary response** - We assume that the *expected* response -- the response probability -- is a nonlinear continuous function $G(\mathbf x)$ of the explanatory variables, with $G'>0$, $G(-\infty) \rightarrow 0$ and $G(\infty) \rightarrow 1$ > $$ > E(y|\mathbf x) \, =\, P(y=1 | \mathbf{x}) \, = \, G\left(\beta_{0}+\beta_{1} x_{1}+\ldots+\beta_{k} x_{k}\right) \, = \, G(\mathbf{x} \boldsymbol{\beta}) \ > $$ - $G(\mathbf x)$ is the (inverse) **link function** which links the variables $\mathbf x$ to the probability that $y_i = 1$ - Obvious candidates for the function $G()$ are **cumulative distribution functions** (cdf), which are increasing and map $[-\infty, +\infty]$ into the unit interval $[0,1]$ The **estimation model** for the observable variable $y$ therefore is $$ y \, = \, G(\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k) + u $$ ------------------------------------------------------------------------ - The error term $u$ of this model has the same properties as the one in the LPM, especially the errors are heteroscedastic with $\operatorname{Var}(u_i | \mathbf x) = G(\mathbf x \boldsymbol \beta) (1 - G(\mathbf x \boldsymbol \beta))$, (see @sec-LPM) - Conditional on $\mathbf x$, the observations $y_i$ are Bernoulli distributed - Regarding the choice of the link function there are two main models - **Probit** model with the (inverse) link function: (with $\phi(z)$ being the **probability density** and $\Phi(z)$ being the **cumulative distribution function** (cdf) of the **standard normally distribution**, see @eq-A_normal_distribited_density_stand and @eq-A_normal_distribited_cdf_stand) $$ G(z) \ = \ \Phi(z) := \, \int_{-\infty}^z \phi(v)dv \ = \, \int_{-\infty}^{\mathbf x \boldsymbol \beta} \phi(v)dv \ \ \quad \text{ :cdf of standard normal distribution} $$ - **Logit** model with the (inverse) link function: $$ G(z) \ = \ \Lambda(z) := \, \dfrac {e^z}{1 + e^z} \ = \ \dfrac {e^{\mathbf x \boldsymbol \beta}}{1 + e^{\mathbf x \boldsymbol \beta}} \ \ \ \quad \text{ :cdf of standard Logistic distribution} $$ - Both distributions are symmetric and have quite similar shapes. However, the Logistic distribution has fatter tails - it's nearly identical to a $t(8)$-distribution; therefore, outliers are somewhat more probable rendering the logistic estimates a little bit more robust against outliers ------------------------------------------------------------------------ #### Latent variable interpretation - Both the Logit and Probit model have a **latent variable** formulation/interpretation $$ y^* = \mathbf x \boldsymbol \beta + e, \quad \text{ and } \ \ y_i = \begin{cases} 1 \ \text{ if } \ y_i^*>0 \\ 0 \ \text{ if } \ y_i^*\leq 0 \end{cases} $$ - $e \sim N(0,1)$ with the Probit model; and $e \sim Logistic(0,\pi^2/3)$ with the Logit model - If the **latent variable** $y^*$ is larger than zero, the **observable** $y$ takes on the value $1$, if it is less or equal zero, $y$ takes on $0$ ($y^*$ can thus be interpreted as the propensity to have $y = 1$) - The latent variable is often interpreted as **net utility** for doing something -- **Random Utility models**. If this net utility is higher as some threshold (0 in our case), then the utility is higher than the opportunity costs (utility of other alternatives) and the action is taken, i.e. $y=1$ - Note that there is rarely a well defined unit of measurement for $y^*$. Therefore, identification requires a scaling, which is done by setting the variance of $e$ to one, respectively to $\pi^2/3$ - We do not observe the *utility* of the action -- $y^*$ is a latent variable -- we only observe whether the action is taken or not, $y=1$ or $y=0$. But this is the case if $y^* >$ or $\leq 0$. In the following it is shown that this **model of discrete choice** leads to our estimation model ------------------------------------------------------------------------ - From the latent variable model it follows: $$ E(y \,| \, \mathbf x) \ = \ P(y = 1\,| \,\mathbf x) \ = \ P(y^* > 0 \,| \,\mathbf x) \ = \ P(\mathbf x \boldsymbol \beta + e > 0\,| \,\mathbf x) \ = \ P(e > -\mathbf x \boldsymbol \beta \,| \,\mathbf x) \ \ = $$ $$ 1 - G(-\mathbf x \boldsymbol \beta) \ \ = \text { (by symmetry) } \ $$ $$ G(\mathbf x \boldsymbol \beta) \ = \begin{cases} \Phi(\mathbf x \boldsymbol \beta) \ \text{ :Probit model } \\ \Lambda(\mathbf x \boldsymbol \beta) \ \text{ :Logit model } \end{cases} $$ - Hence, the latent variable model serves as a *theoretical foundation* of our nonlinear estimation model, postulated at the begin of this section. We have shown that: $$ E(y \,| \, \mathbf x) \, = \, G(\mathbf x \boldsymbol \beta) $$ - This leads to the estimation model $$ y \, = \, G(\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k) + u $$ - Note, neither a threshold of $0$ in the latent variable model nor $\operatorname{Var}(e) = 1$ or $\pi^2/3$ in the latent variable model is a restriction of generality. The former is reflected by the intercept, the latter is only a scaling factor of the variables. So, nether the threshold nor $\operatorname{Var}(e)$ are of particular value; they are not identified and hence free to choose ------------------------------------------------------------------------ ### The Gumbel model - Sometimes, a non-symmetric distribution seems more appropriate, for instance if the number of ones (success) and zeros (unsuccess) are very different in the sample. In this case the *Gumbel* distribution (an extreme value distribution) can be used as an interesting alternative link function $$ G(z) \ =\, {e^{-e^{-z}}} \ = {e^{-e^{-\mathbf x \boldsymbol \beta}}} \ \ \ \ \quad \text{ :cdf of standard Gumbel distribution} $$ - This is a right skewed distribution, which might be appropriate if the ones are the rare (extreme) outcomes. If the zeros are the rare outcomes one could use a left skewed distribution, the (mirrored) Complementary Gumbel distribution: $$ 1-G(-z) \ =\, 1-e^{-e^{z}} \ = 1-e^{-e^{\mathbf x \boldsymbol \beta}} \ \ \ \ \quad \text{ :cdf of Complementary Gumbel distribution} $$ - Interestingly, only this version is implemented with the R procedure `glm()`. So, if you want to use this left skewed distribution and the ones are the rare outcomes in your sample, you have to recode the $y$-s with the command: "`ifelse`(dataset\$varname_of_y == 1, 0, 1)", making ones to zeros and zeros to ones ------------------------------------------------------------------------ ### LPM versus Probit To show graphically the difference between the LPM and the Probit (or Logit) model we estimate the probability of denial of a bank credit (`deny`) in dependence of the ratio of monthly dept payments to monthly income (`pirat`) ```{r, message=FALSE} library(visreg) # for plotting partial effect library(wooldridge); data("HMDA") # data set from of the wooldridge library # convert `deny` to numeric. This is coded as a factor variable in the data set HMDA$deny <- as.numeric(HMDA$deny) - 1 ``` ```{r} # estimating linear probability model denylpm <- lm(deny ~ pirat, data = HMDA) ``` To estimate the **probit** model, we use the `gml()` function in R - glm for **g**eneralized **l**inear **m**odels. There are **two** additional option to set in this function: - family for the distribution of *y*; this is binomial in our case (binary dependent variable) - and you have to specify the *link function* $G(\mathbf x \boldsymbol \beta)$ for the probit model ```{r} # estimating probit model denyprobit <- glm(deny ~ pirat, family = binomial(link = "probit"), data = HMDA) ``` ------------------------------------------------------------------------ ```{r, fig.height=5} #| fig-align: center #| label: fig-LPM1 #| fig-cap: "Estimated probability of denying a bank credit with a LPM" #| code-fold: true visreg(denylpm, band = FALSE, ylim=c(-0.05, 1.05), main = "LPM", xlab = "P/I ratio") abline(0,0, lty = 2, col = "darkred"); abline(1,0, lty = 2, col = "darkred") text(2.5, 0.95, cex = 0.8, "Mortgage denied"); text(2.5, 0.05, cex= 0.8, "Mortgage approved") points(deny~pirat, data = HMDA, pch = 20, cex.main = 0.85) ``` ------------------------------------------------------------------------ ```{r, fig.height=5} #| fig-align: center #| label: fig-Probit1 #| fig-cap: "Estimated probability of denying a bank credit with a Probit model" #| code-fold: true visreg(denyprobit, scale = "response", band = FALSE, ylim=c(-0.05, 1.05), main = "Probit", xlab = "P/I ratio", rug=0) abline(0,0, lty = 2, col = "darkred"); abline(1,0, lty = 2, col = "darkred") text(2.5, 0.95, cex = 0.8, "Mortgage denied"); text(2.5, 0.05, cex= 0.8, "Mortgage approved") points(deny~pirat, data = HMDA, pch = 20, cex.main = 0.85) ``` ------------------------------------------------------------------------ ### Partial Effects - The two previous figures show that the predicted probabilities in probit model are always within the unit interval, contrary to the LMP - Furthermore, the partial effects of the explanatory variables are not constant any more (as in the LPM), but are nonlinear and depend on the values of $\mathbf x$ - On the one hand, this is a big *theoretical* advantage as linear partial effects are not possible in the context of predicting probabilities - On the other hand, this is a *practical* disadvantage as the estimated coefficients $\hat\beta$ are not easily interpretable any more. Basically, you have to relay on average partial effects, APE. - The **partial effect** of the explanatory variable $x_j$ on the probability that $y=1$ is $$ \dfrac{\partial P(y=1| \mathbf x)}{\partial x_j} \ = \ \dfrac {\partial G(\mathbf x \boldsymbol \beta)}{\partial x_j} \ = \begin{cases} \dfrac {\Phi(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \phi(\mathbf x \boldsymbol \beta) \, \beta_j \ \text{ : Probit } \\ \dfrac {\Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \Lambda(\mathbf x \boldsymbol \beta) (1 - \Lambda(\mathbf x \boldsymbol \beta) ) \, \beta_j \ \text{ : Logit } \end{cases} $$ ------------------------------------------------------------------------ - For the probit case, $\phi$ is the density function of the normal distribution and we have $$ \dfrac {\partial \Phi(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \, \dfrac {\partial}{\partial \beta_j} \int_{-\infty}^{\mathbf x \boldsymbol \beta} \phi(v)dv \ = \ \phi(\mathbf x \boldsymbol \beta) \beta_j $$ - For the logit case, applying the Quotient rule, we have $$ \dfrac {\partial \Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \dfrac {\partial}{\partial \beta_j} \left( \dfrac {e^{\mathbf x \boldsymbol \beta}}{1 + e^{\mathbf x \boldsymbol \beta}} \right) \ = \ \dfrac {e^{\mathbf x \boldsymbol \beta}}{ (1 + e^{\mathbf x \boldsymbol \beta})^2} \, \beta_j $$ After some manipulations, we eventually get $$ \dfrac {\partial \Lambda(\mathbf x \boldsymbol \beta)}{\partial \beta_j} \ = \ \Lambda'(\mathbf x \boldsymbol \beta) \, \beta_j \ = \ \Lambda(\mathbf x \boldsymbol \beta) \, (1 - \Lambda(\mathbf x \boldsymbol \beta) ) \, \beta_j $$ - In both cases the partial effect of $x_j$ on the $P(y=1 | \mathbf x)$ is **not** equal to $\beta_j$ (which is partial effect on the latent variable $y^*$), but $\beta_j$ must be *multiplied by the corresponding density function, evaluated at* $\, \mathbf x \boldsymbol \beta$, i.e., at particular values of $\mathbf x$ - Hence, the **partial effects are nonlinear and depend on the level of** $\mathbf x$ - Remark: The formulas for the partial effects just derived are only valid for **continuous** or nearly continuous variables ------------------------------------------------------------------------ - One can calculate **APE** (**average partial effects**) for instance with the R package `marginaleffects`, or **PEA** (partial effects at average) - The former is the partial effect calculated for each observation $\mathbf x_i$ and than averaged. For the Probit model this is: $\widehat{A P E}_{j} =\left( \frac{1}{n} \sum_{i=1}^{n} \phi\left(\mathbf{x}_{i} \hat{\boldsymbol{\beta}}\right) \right) \hat\beta_{j}$ - The latter is calculated at the mean of $\mathbf x$, $\bar {\mathbf x}$. For the Logit model this is $\, \widehat{P E A_{j}} = \Lambda' (\bar{\mathbf{x}} \hat{\boldsymbol{\beta}}) \hat\beta_{j}$ - PEAs could be somewhat problematic in the case of dummy variables. For instance, if you have a dummy variable for gender in your model, averaging to 0.5 might be questionable - For a **discrete** explanatory variable $x_k$, e.g., the number of years of education or gender, the APE can directly be computed (for the Logit model replace $\Phi$ by $\Lambda$) $$ \widehat{A P E}_{k} \ = \ n^{-1} \sum_{i=1}^n \left( \Phi\left[\beta_{0}+\beta_{1} x_{i1}+\ldots+\beta_{k}\left(c_{ik}+1\right) \right] \ - \ \Phi \left[\beta_{0}+\beta_{1} x_{i1}+\ldots+\beta_{k} c_{ik} \right] \right) $$ The R package `marginaleffects` apply this formula automatically to factor variables ------------------------------------------------------------------------ ### Estimation - As both models are highly nonlinear, OLS can not be applied. In principle, NLS (nonlinear last squares) could be applied, but because of the heteroscedasticity of the error terms this would not be efficient - **Maximum Likelihood** estimation (MLE) is the main procedure - Thereby, we are looking for coefficients $\boldsymbol \beta$ so that the **observed sample has the highest probability** - Each observation $i$ is treated as a single and **independent** draw from a Bernoulli distribution (binominal with one draw) - We have a success probability for observation $i$: $\text{ }$ (remember: $G=\Phi$ for probit and $G=\Lambda$ for logit) $P(y_i=1 \, | \, \mathbf x; \boldsymbol \beta) = G(\mathbf x_i \boldsymbol \beta)$ - and the non-success probability for observation $i$:\ $P(y_i=0 \, | \, \mathbf x; \boldsymbol \beta) = 1-G(\mathbf x_i \boldsymbol \beta)$ - This leads to the *joint probability function* (for *all* observations) $$ P(y_1, \ldots, y_n \, | \, \mathbf x; \boldsymbol \beta) \ = \ \underset {y=0} {\prod_{i=1}^n} \left[ 1-G(\mathbf x_i \boldsymbol \beta) \right] \, \underset {y=1} {\prod_{i=1}^n} G(\mathbf x_i \boldsymbol \beta) $$ ------------------------------------------------------------------------ - This can be conveniently rewritten to the **likelihood function** of the sample for the $n$ observations $$ L(\boldsymbol \beta \ | \ \text{data}) \ = \ \prod_{i=1}^n \underbrace {\left[ G(\mathbf x_i \boldsymbol \beta) \right]^{y_i} \, \left[1-G(\mathbf x_i \boldsymbol \beta) \right]^{1-y_i}}_{= f(y_i | \mathbf x, \boldsymbol \beta), \, \text{ conditional density of } y_i} $$ - (Note, if $y_i=0$ then $1-y_i=1$ and if $y_i=1$ then $1-y_i=0$) - Taking logs we get the **log-likelihood function** $$ \ln L(\boldsymbol \beta \ | \ \text{data}) \ = \ \sum_{i=1}^n \underbrace {\left\lbrace y_i \ln \left[ G(\mathbf x_i \boldsymbol \beta) \right] \, + \, (1-y_i) \ln \left[1-G(\mathbf x_i \boldsymbol \beta) \right] \right \rbrace}_{\ln f(y_i | \mathbf x, \boldsymbol \beta)} $$ - We maximize $\ln L(\boldsymbol \beta \, | \, \text{data})$ with respect to $\boldsymbol \beta$ such that the observed sample has the highest probability. This leads to the **likelihood equations** (the counterparts to the normal equations of OLS) $$ \dfrac { \partial \ln L}{\partial \boldsymbol \beta} \ = \ \sum_{i=1}^n \left[ y_i \dfrac{g(\mathbf x_i \boldsymbol \beta)}{G(\mathbf x_i \boldsymbol \beta)} \ + \ (1-y_i) \dfrac{-g(\mathbf x_i \boldsymbol \beta)}{(1-G(\mathbf x_i \boldsymbol \beta)} \right] \mathbf x_i \ = \ \mathbf 0, \quad \text{ with } g := G' $$ ------------------------------------------------------------------------ - After some manipulations, for the Logit case, these likelihood equation reduce to $$ \dfrac { \partial \ln L}{\partial \boldsymbol \beta} \ = \ \sum_{i=1}^n [ y_i - \Lambda(\mathbf x_i \boldsymbol \beta) ] \mathbf x_i \ = \ \mathbf 0 \ $$ - Noting that $[y_i - \Lambda(\mathbf x_i \boldsymbol \beta)]$ are the residuals of the model, the similarities to the OLS normal equations are obvious (compare @eq-1b.9) - If $\boldsymbol \beta$ contains an intercept, the first likelihood equation is $$ \dfrac { \partial \ln L}{\partial \beta_0} \ = \ \sum_{i=1}^n [ y_i - \Lambda(\mathbf x_i \boldsymbol \beta) ] \mathbf \ = \ 0 \ \ \ \Rightarrow \ \ \ \frac{1}{n} \sum_{i=1}^n y_i = \frac{1}{n} \sum_{i=1}^n \Lambda(\mathbf x_i \boldsymbol \beta) $$ - Hence, portions of ones in the sample equals the the average of the predicted probabilities - Interestingly, the likelihood equations for the Probit case don't look so familiar, involving $\phi/\Phi$, the inverse Mills ratio - In both cases, the likelihood equations are highly nonlinear and *have to be solved numerically*, for instance with the Newton-Raphson algorithm ------------------------------------------------------------------------ ### Example - We once more examine a model for the probability of labor force participation of *married woman*, `inlf` (for in labor force) - The explanatory variables are: `nwifeinc` (additional family income), `educ`, `exper`, `exper^2`, `age`, `kidslt6` (number of children under 6 years) and `kidsge6` (number of children over 6 years) - This time, we additionally estimate Logit and Probit models. Furthermore, we used the R-package `marginaleffects` to calculate the implied APE - The results presented below are pretty similar for the three models - Note, you should only directly compare the APE - As shown previously, *the coefficients of Logit and Probit models are non-linear transformations of the partial effects* - The Logit coefficients are typically 1.6 to 1.8 times larger than the Probit coefficients\ - Probit coefficients are typically about 3.3 times larger than the ones of the LPM - Fortunately, the R-package `marginaleffects` is doing the work for us and calculates the appropriate APE - All estimated coefficients have the expected sign - Once more, of special interest is the variable `kidslt6`, the number of children under 6 years. The estimated APE is about -0.26 meaning that, for an **average** woman, an additional young child reduces the probability to be in the labor force by 26% - But for Logit and Probit models, this partial effect depends in a nonlinear manner on the values of the explanatory variables. This is a main advantage compared to the LPM. The following figures show these nonlinear effects of the number of children under 6 on labor participation ------------------------------------------------------------------------ ```{r, message=FALSE} #| comment: " " library(wooldridge) library(marginaleffects) library(modelsummary) data("mroz") # LPM model outlpm <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz) #Logit model outlogit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, family = binomial(link = "logit")) # Probit model outprobit <- glm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, family = binomial(link = "probit")) # Average partial effects with avg_slopes() from marginaleffects package marg_lpm <- avg_slopes(outlpm) marg_logit <- avg_slopes(outlogit) marg_probit <- avg_slopes(outprobit) summary(outlogit) summary(outprobit) ``` ```{r} #| code-fold: true # Modifications for modelsummery to print HitRates for glm() library(broom) # untereinander glance_custom.glm <- function(x, ...) { HRate <- round( hitrate(x, stragaz=FALSE)$HitRate, 2 ) Sens <- round( hitrate(x, stragaz=FALSE)$Sensitivity, 2 ) Spec <- round( hitrate(x, stragaz=FALSE)$Specificity, 2 ) Ratio <- round( hitrate(x, stragaz=FALSE)$Nof1s / hitrate(x, stragaz=FALSE)$Nof0s, 2 ) out <- data.frame("HitRate" = HRate, "Sensitifity" = Sens, "Specificity" = Spec, "Ratio 1/0" = Ratio) return(out) } ``` ```{r} #| code-fold: true #| label: tbl-BinaryCompare #| tbl-cap: "Probability for labor force participation of woman, several different models" modelsummary( list(" LPM "=outlpm, " LPM APE "=marg_lpm, " Logit "=outlogit, " Logit APE "=marg_logit, " Probit "=outprobit, " Probit APE "=marg_probit), output="flextable", gof_omit = "F", align = "ldddddd", stars = TRUE, fmt = 4) ``` ------------------------------------------------------------------------ ```{r, fig.height=5} #| code-fold: true #| fig-align: center visreg(outlpm, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05), main = "Effect of number of children under 6 on LF-participation - LPM") ``` ------------------------------------------------------------------------ ```{r, fig.height=5} #| code-fold: true #| fig-align: center visreg(outprobit, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05), main = "Effect of number of children under 6 on LF-participation - Probit") ``` ------------------------------------------------------------------------ ```{r, fig.height=5} #| code-fold: true #| fig-align: center visreg(outlogit, "kidslt6", scale = "response", band = FALSE, ylim=c(-0.05, 1.05), main = "Effect of number of children under 6 on LF-participation - Logit") ``` ------------------------------------------------------------------------ ## Maximum likelihood estimation in general {#sec-MLE} The starting point for obtaining **maximum likelihood estimators** (MLE) is the assumption that the *conditional distribution* of the data -- *the data generating process* -- is known, except for a finite number of parameters $\boldsymbol \theta$ which have to be estimated - So, we have the *probability density function*, (if $y_i$ is a continuous RV) respectively the *probability mass function* (if $y_i$ is a discrete RV) for observation $i$ $$ f(y_i \, | \, \mathbf x_i; \boldsymbol \theta), \ \ \forall i $$ - The *joint* density of *n* *independent and (identically) distributed* (i.i.d.) observations for this process is the *product* of the individual densities $$ f(y_1, \ldots, y_n \, | \, \mathbf X; \boldsymbol \theta) \ = \ \prod_{i=1}^n f(y_i \, | \, \mathbf x_i; \boldsymbol \theta)) \ = \ L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X) $$ - The joint density also defines the Likelihood function $L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X)$, which is a function of the unknown parameters $\boldsymbol \theta$, given the sample data $(\mathbf y, \mathbf X)$ - Note the reversed writing of the variables. This should simply indicate that we are interested in finding the parameters given the data (*not* that $\boldsymbol \theta$ is random) - **Remark**: If the observations are not independent, as it is very often the case with time series, the joint density is more complicated and involves the covariance matrix of the observations $y_i$. For more complicated time series modes, in many cases the joint density can be obtained with the so called *prediction-error decomposition* of the likelihood which is calculated with the help of the Kalman-filter ------------------------------------------------------------------------ - As it usually simper to work with logs, the log-likelihood function is $$ \ln L(\boldsymbol \theta \ | \ \mathbf y, \mathbf X) \ = \ln L(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \ln f(y_i \, | \, \mathbf x_i, \boldsymbol \theta) $$ - We apply the **Maximum Likelihood principle**: In estimating $\boldsymbol \theta$, we are **looking for parameters** $\boldsymbol \theta$ so that the **observed sample has the highest probability** $$ \underset{\boldsymbol \theta}{\max} \ \ln L(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \ln f(y_i \, | \, \mathbf x_i, \boldsymbol \theta) $$ - The **first order conditions** of this optimization problem lead to the so called **likelihood equations**, which are the counterparts to the normal equations of OLS $$ \dfrac { \partial \ln L}{\partial \boldsymbol \theta} := \ \mathbf g(\boldsymbol \theta) \ = \ \sum \nolimits_{i=1}^n \frac {\partial }{\partial \boldsymbol \theta} \, \ln f(y_i | \mathbf x_i, \boldsymbol \theta) \ = \ \mathbf 0 $$ - The solution of the likelihood equations leads to the estimated parameter vector $\hat{\boldsymbol \theta}_{LM}$ - This estimate is **identified** (estimable) if for any other $\boldsymbol \theta$, $L(\hat{\boldsymbol \theta}_{LM}) \neq L(\boldsymbol \theta)$ (uniqueness) - In most of the cases, the optimization has to rely on numerical optimization procedures like the Newton-Paphson algorithm - The **second order condition** for maximization of the likelihood function requires that the **Hessian matrix** is negative definite, so that the likelihood function is concave at the optimum ------------------------------------------------------------------------ ### Properties of MLE If the distributional assumption holds Maximum Likelihood estimators are generally - **Consistent** - $\operatorname{plim} \hat{\boldsymbol \theta}_{ML} = \boldsymbol \theta_T, \quad$ with $\boldsymbol \theta_T$ being the true parameter vector - But in finite samples, MLE are not generally unbiased\ $\text{ }$ - **Asymptotically normal** with $$ \hat {\boldsymbol \theta}_{ML} \ \, \stackrel{a}{\sim} \ \, N \left( \boldsymbol \theta_T ,\, \lbrace -E[ \mathbf H(\boldsymbol \theta_T)] \rbrace^{-1} \right) $$ - Thereby, $\mathbf H(\boldsymbol \theta_T)$ is the **Hessian Matrix** of the log-likelihood function, evaluated at the true parameter vector $\boldsymbol \theta_T$: $$ \operatorname {Asy.Var} (\hat {\boldsymbol \theta}_{ML}) \, = \, \ \left( - E \left[ \mathbf H(\boldsymbol \theta_T) \right] \right) ^{-1} \, = \, \left( -E \left[ \dfrac {\partial^2 \ln L}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right] \right)^{-1} \, = \, $$ $$ \left( -E \, \left[ \begin{array}{ccc} \dfrac {\partial^2 \ln L }{\partial \theta_0 \partial \theta_0} & \cdots & \dfrac {\partial^2 \ln L }{\partial \theta_0 \partial \theta_k} \\ \vdots & \ddots & \vdots \\ \dfrac {\partial^2 \ln L }{\partial \theta_k \partial \theta_0} & \cdots & \dfrac {\partial^2 \ln L }{\partial \theta_k \partial \theta_k} \end{array} \right] \right)^{-1} $$ $\text{ }$ - **Asymptotically efficient** in the class of consistent and asymptotically normal estimators - The *Cramer-Rao Lower Bound* is the asymptotic lower bound of the variance of a consistent and asymptotically normal estimator. And this can be shown to be equal to the *inverse* of the **Information Matrix** - The **Information matrix** $\mathbf I_n(\boldsymbol \theta_T)$ for $\boldsymbol \theta_T$ is defined as the *covariance matrix of the gradient* $\mathbf g$, **the score**, of the log-likelihood function evaluated at the true parameter vector. $\mathbf I_n(\boldsymbol \theta_T)$ measures the amount of information about $\boldsymbol \theta_T$ contained (on average) in a realization of the data $$ \mathbf I_n (\boldsymbol \theta_T) := \operatorname {Var} \left( \mathbf g(\boldsymbol \theta_T) \right) = E \left( \mathbf g(\boldsymbol \theta_T) \, \mathbf g(\boldsymbol \theta_T)' \right), $$ $$\text{with } \ \mathbf g(\boldsymbol \theta) := \dfrac {\partial }{\partial \boldsymbol \theta} \ln L(\boldsymbol \theta; \mathbf y), \ \ E[\mathbf g(\boldsymbol \theta_T)] = \mathbf 0 $$ - Under fairly general conditions, the *Information Matrix* is equal to the negative expected value of the *Hessian Matrix*, evaluated at the true parameters. This is the *Information matrix equality* $$ \mathbf I_n (\boldsymbol \theta_T) = \operatorname {Var} \left( \mathbf g(\boldsymbol \theta_T) \right) = -E[\mathbf H(\boldsymbol \theta_T)] $$ - The right hand term equals the inverse of the covariance matrix of $\hat {\boldsymbol \theta}_{ML}$. Hence, the variance of MLE is equal to Cramer-Rao Lower Bound and therefore, the MLE is asymptotically efficient - *Finite sample property*: If the sample is drawn from an exponential family distribution and if there exists an unbiased minimum variance estimator, then it is a MLE ------------------------------------------------------------------------ ### Covariance matrix - Sketch of a *proof* that the MLE is *asymptotically normal*. We make a first order approximation of the *first order conditions* and assume that $\mathbf y$ is i.d. (to easy the notation we write $\hat{\boldsymbol \theta}$ instead of $\hat{\boldsymbol \theta}_{LM}$) $$ \mathbf g (\hat {\boldsymbol \theta}) = \mathbf 0, \ \ \Rightarrow \ \ \mathbf g ( \hat {\boldsymbol \theta}) \approx \mathbf g (\boldsymbol \theta_T) + \underbrace{\mathbf H(\boldsymbol \theta_T)}_{\frac {\partial }{\partial \boldsymbol \theta} \mathbf g(\boldsymbol \theta_T)} (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = \mathbf 0 \ \ \Rightarrow $$ $$ \mathbf g (\boldsymbol \theta_T) = -\mathbf H(\boldsymbol \theta_T)(\hat {\boldsymbol \theta}-\boldsymbol \theta_T) \ \ \Rightarrow $$ $$ (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = [-\mathbf H(\boldsymbol \theta_T)]^{-1} \mathbf g (\boldsymbol \theta_T) \ \ \Rightarrow $$ $$ (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = [- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]^{-1} \frac{1}{n} \mathbf g (\boldsymbol \theta_T) \ \ \Rightarrow $$ $$ \sqrt n (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) = \underbrace {[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]^{-1}}_{\stackrel{p}{\longrightarrow} \ \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1}} \ \ \ \ \cdot \underbrace{\sqrt n\ \bar{\mathbf g} (\boldsymbol \theta_T)}_{\stackrel{d}{\longrightarrow} \ N(\mathbf 0,\, E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)])} \ \ \Rightarrow $$ $$ \sqrt n (\hat {\boldsymbol \theta}-\boldsymbol \theta_T) \ \stackrel{d}{\longrightarrow} \ N \left( \mathbf 0, \, \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1} \, E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)] \, \lbrace E[- \frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1} \right) \ = $$ $$ N (\mathbf 0, \, \lbrace E[-\frac{1}{n} \mathbf H(\boldsymbol \theta_T)]\rbrace^{-1}) $$ ------------------------------------------------------------------------ - For the last two we invoke the i.d. assumption and the Information matrix equality. In this case $$ \mathbf g (\boldsymbol \theta) = \sum\nolimits_{i=1}^n \mathbf g_i(\boldsymbol \theta), \ \ \text{ with } \mathbf g_i(\boldsymbol \theta) := \dfrac {\partial \ln f(y_i|\mathbf x, \boldsymbol \theta)} {\partial \boldsymbol \theta}, \ \ \text{ and } \ \ \mathbf {\bar g} (\boldsymbol \theta) = \frac {1}{n} \sum\nolimits_{i=1}^n \mathbf g_i(\boldsymbol \theta) $$ $$ \mathbf H(\boldsymbol \theta) = \sum\nolimits_{i=1}^n \mathbf H_i (\boldsymbol \theta), \ \ \text{ with } \mathbf H_i(\boldsymbol \theta) := \dfrac {\partial^2 \ln f(y_i|\mathbf x, \boldsymbol \theta)} {\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} $$ - $\sqrt n\ \bar{\mathbf g} (\boldsymbol \theta_T)$ is $\sqrt n$ times the average of random variables $\mathbf g_i$ with an expected value of 0 and covariance matrix $\operatorname {Var}\left( \mathbf g_i (\boldsymbol \theta_T) \right) = -E[\mathbf H_i (\boldsymbol \theta_T)]$ (by the Information matrix equality) and an average covariance matrix of $-E[\frac{1}{n} \sum_i \mathbf H_i (\boldsymbol \theta_T)] = -E[\frac {1}{n} \mathbf H (\boldsymbol \theta_T) ]$. Hence, we can invoke a CLT - $-\frac{1}{n} \mathbf H(\boldsymbol \theta_T)$ is an average too and, by LLN, converges in probability to its expected value $-E[\frac {1}{n} \mathbf H(\boldsymbol \theta_T)]$ - From the *limiting distribution* of $\sqrt n (\hat {\boldsymbol \theta}_{LM} - \boldsymbol \theta_T)$ we get the *asymptotic distribution* of $\hat{\boldsymbol \theta}_{ML}$ $$ \hat {\boldsymbol \theta}_{LM} \ \, \stackrel{a}{\sim} \ \, N \left( \boldsymbol \theta_T ,\, \lbrace -E[ \mathbf H(\boldsymbol \theta_T)] \rbrace ^{-1} \right) $$ - This basically implies **consistency** of $\hat {\boldsymbol \theta}_{LM}$ too. The essential point was $E[\mathbf g(\boldsymbol \theta_{T})] = \mathbf 0$ in third last line of the proof ------------------------------------------------------------------------ As the expected value in the covariance matrix formula is generally difficult to obtain, the covariance matrix of the estimated parameters is simply estimated by the inverse of the *actual* (not the expected) negative Hessian matrix at $\hat{\boldsymbol \theta}_{LM}$ $$ \widehat{\operatorname {Var} (\hat{\boldsymbol \theta})} \ = \ -\mathbf H(\hat {\boldsymbol \theta})^{-1} \ = $$ $$ \left( -\dfrac {\partial^2 \ln L(\hat{\boldsymbol \theta})}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right)^{-1} \ = \ \left( - \dfrac { \sum_{i=1}^n \partial^2 \ln f(y_i | \mathbf x; \hat{\boldsymbol \theta})}{\partial \boldsymbol \theta \, \partial \boldsymbol \theta'} \right)^{-1} \ = \ \left( -\sum_{i=1}^n \mathbf H_i(\hat {\boldsymbol \theta}) \right)^{-1} $$ with $f(y_i | \mathbf x; \boldsymbol \theta)$ being the conditional distribution function of observation $i$ - Another often used estimator for the covariance matrix involves the *outer product of the gradient (score) vectors*. This is based on the fact shown above that the Information matrix is defined as the *variance of the scores*, which is the *expected value* of the outer product of the gradient vectors $$ \widehat{\operatorname {Var} (\hat{\boldsymbol \theta})} \ = \ \left( \sum_{i=1}^n \mathbf g_i(\hat{\boldsymbol \theta}) \, \mathbf g_i(\hat{\boldsymbol \theta})' \right)^{-1} $$ - This is the so called **BHHH** estimator and is comparatively easy to obtain. This estimate is always positive definite, but in small samples the estimator using the Hessian Marix generally performs better ------------------------------------------------------------------------ - **Why does the variance of the estimated parameters depend on the second derivatives?** - The second derivative measures how concave the function is. If this is (in absolute terms) *small* at the optimum (nearly linear, with a slope of 0), the likelihood function is *flat at the optimum*. This means that there is a large interval of other parameters having nearly the same likelihood. Hence, the variance of the estimated $\theta$ is *high* - On the other hand, if the second derivative is *large* in absolute terms, the likelihood function is very curved at the optimum, meaning that there is only a small interval of parameters having nearly the same likelihood. Therefore, the estimate is precise and the variance of $\hat \theta$ is *low* - Hence, we have an *inverse relationship* between the size of the second derivatives of the likelihood function and the variance of the estimated parameters; and this inverse rekationship is reflected with the covariance matrix formula - As we already know, the **second order condition** for maximization of the likelihood function requires that the Hessian matrix is negative definite, so that the likelihood function is concave at the optimum - For the Logit model we have: $\mathbf H = - \sum_i \Lambda(\mathbf x_i \boldsymbol \beta) (1-\Lambda(\mathbf x_i \boldsymbol \beta) )\mathbf x_i \mathbf x_i'$, which is clearly negative definite for all $\boldsymbol \theta$ - As similar result applies to the Probit model which means that the likelihood functions are always concave for the two models, *guaranteeing a global maximum* ------------------------------------------------------------------------ ### Hypothesis testing {#sec-3tests} - Because we have an estimate of the covariance matrix of $\hat {\boldsymbol \theta}$ we can carry out all of the usual **Wald** tests about linear hypothesis (after linearization, nonlinear restrictions are also possible - the restriction matrix $\mathbf R$ is replaced by the *Jacobi Matrix* of the nonlinear restrictions) - With this test procedure, *only the unrestricted model have to be estimated*. If the restriction is not compatible with the data, i.e., the *restricted* estimate $\mathbf R \hat{\boldsymbol \theta}$ is too far away from the hypothesized value, the restriction is rejected - The big advantage of this test strategy is that with a large enough sample no assumption about the distribution is required (we do not need the likelihood function) ------------------------------------------------------------------------ - In the context of MLE, we have an estimate of the log-likelihood of the observed sample. This can be exploited for a **Likelihood Ratio** (**LR**) test - With this procedure, *the restricted as well as the unrestricted model is estimated*. We are then looking at the log-ratio of the likelihood of the two models. If the difference of the log-likelihood of the two models is too large, the restricted model can be rejected. The test statistic is asymptotically $\chi^2$-distributed with $m$ degrees of freedom, $m$ being the number of restrictions $$ LR \ = \ 2 \left( \ln L(\hat{\boldsymbol \theta}_{ur}) - \ln L(\hat{\boldsymbol \theta}_r) \right) \ \stackrel{a}{\sim} \ \chi^2(m) $$ ------------------------------------------------------------------------ - The third test concept, which is also derived from the Maximum Likelihood principle, is the **Lagrange Multiplier** test (**LM**) - With this procedure, *only the restricted model* is estimated. To obtain the parameter of the restricted model the log-likelihood function is maximized with respect to $m$ constraints $\mathbf c(\boldsymbol \theta)=\mathbf 0$, using the *Lagrange function* $$ \underset{\boldsymbol \theta}{\max}: \ \ln L(\boldsymbol \theta; \mathbf y, \mathbf X) - \boldsymbol \lambda' \mathbf c(\boldsymbol \theta) $$ - The first order condition shows that a restriction is only binding (has an influence on the estimated parameters) if the corresponding element of the $(m \, \text{x} \, 1)$ Lagrange multiplier vector $\boldsymbol \lambda$ is different from zero $$ \underbrace {\dfrac {\partial }{\partial \boldsymbol \theta} \ln L(\boldsymbol \theta; \mathbf y, \mathbf X)}_{\mathbf g(\boldsymbol \theta)} - \boldsymbol \lambda' \dfrac {\partial }{\partial \boldsymbol \theta} \mathbf c(\boldsymbol \theta) = \mathbf 0 $$ - If the restriction is of the form $\theta_j = c_j$, with $c_j$ some constants, the corresponding derivative of the constraint is one, $\dfrac {\partial }{\partial \theta_j} \mathbf c(\boldsymbol \theta) = 1$ and then, from the first order conditions, we have $$ \dfrac {\partial }{\partial \theta_j} \ln L(\boldsymbol \theta; \mathbf y, \mathbf X) = \lambda_{\tilde j} $$ - Thus, the vector $\boldsymbol \lambda$, which is often interpreted as *shadow prices* for the restrictions $\theta_j = c_j$, is equal to the corresponding derivatives of the log-likelihood function. Therefore, $\lambda_{\tilde j}$ *measures the extent to which the first order conditions with respect to to the unconstrained model are violated* (without restriction, the corresponding derivative should differ from $\mathbf 0$ only because of sampling errors) - Hence, one idea is to test whether $\boldsymbol \lambda$ is zero. A similar and more common test is *directly* based on the gradient vectors $\mathbf g(\boldsymbol \theta_r)$, the **score** vectors, instead of $\boldsymbol \lambda$; hence, the label **score test**. This as well measures to what extent the gradient differs from $\mathbf 0$, the expected value of $\mathbf g(\boldsymbol \theta)$ in the unconstrained optimum - It turns out that $\mathbf g(\boldsymbol \theta)$ is asymptotically normal distributed and we have a "Wald-like" LM test statistic $$ LM \ = \ \mathbf g(\boldsymbol \theta_r)' \, \mathbf I(\boldsymbol \theta_r)^{-1} \mathbf g(\boldsymbol \theta_r) \ \ \stackrel{a}{\sim} \ \ \chi^2(m) $$ with $\mathbf I(\boldsymbol \theta_r)$ equal to of the inverse Information matrix, evaluated at the restricted parameter vector ------------------------------------------------------------------------ - An asymptotic equivalent procedure to the score test is to first run a *restricted* regression and then, in a second step, an auxiliary regression using the residuals of the first step to test for relaxation of the restrictions - Typically, the test statistic for this variant of the LM test is $\, n \cdot R^2$, which is $\chi^2(m)$ distributed - This procedure is especially suited for testing **misspecifications**, as in this case the restricted variant is typically more easily to estimate than the unrestricted model. Examples are tests for - Heteroskedasticity (Breusch-Pagan test) - Serial correlation (Breuch-Godfrey test) - General misspecifications (RESET) - Endogenous regressors (Hausman-Wu test) - Tests for **exclusions** of variables or **linear restrictions** among variables are typically more easily to carry out with a Wald test - All three test procedures introduced -- Wald tests, LR tests and LM tests -- are only suitable for **nested models** and are **asymptotically equivalent** - The small sample properties of LR and LM tests are generally unknown and the LM test is said to have large small samples distortions. On the other hand, the LM test is known to have a larger power against small deviations from the $H_0$ ------------------------------------------------------------------------ #### Three test procedures ```{r, message=FALSE} #| code-fold: true #| fig-align: center #| label: fig-tests #| fig-cap: "Principles of the three different test procedures, source @green2017" library(magick) img = image_read("topic3_tests.jpg") image_scale(img, "600") ``` ------------------------------------------------------------------------ ### Measures of fit - We already dealed with the *hit rate* in the context binary choice models, with the refinements of *sensitivity* and *specificity* - **Pseudo R-squared**: Compare the maximized log-likelihood of the model with that of a model that only contains a constant (and no explanatory variables); *McFadden's* $\tilde{R}^2$ $$ \tilde{R}^{2}=1- \dfrac {\ln L_{u r}} {\ln L_{0}} $$ - **Correlation based measures**: Look at correlation (or squared correlation) between predicted values or predicted probabilities versus observed values $$ \operatorname {Corr} (y_i, \hat y_i) $$ - In the context of continuous dependent variables models, this corresponds to the usual $R^2$ - **Information criteria**: - **Akaike** information criterion, $AIC = -2 \ln L + 2k$ - **Bayesian** (**Schwarz**) information criterion $BIC = -2 \ln L + k \ln n$ - Both are a measure of the information loss due to an imperfect model. They deal with the trade-off between the goodness of fit and the simplicity of the model, thereby considering both the risk of overfitting and underfitting - They are based on the log-likelihood of the model but place a penalty to the number of estimated parameters like the R-squared bar - These criteria can be used to judge also on **non-nested models** (if $y$ is the same): The **lower the better** -- this strategy is said to lead to better out of sample predictions ------------------------------------------------------------------------ ### Digression: Intoduction to Bayesian Statistics For an introduction to Bayesian statistics watch that video: <https://www.youtube.com/watch?v=Pahyv9i_X2k> As the video above shows, the basic idea of Bayesian statistics is to *update prior information* about a parameter vector $\boldsymbol \theta$ with *observed data*. *Bayes Theorem* implies that the *posterior distribution* of the parameter vector $\boldsymbol \theta$, $f(\boldsymbol \theta|data)$, is proportional to the *likelihood function* $f(data|\boldsymbol \theta)$ times the *prior distribution* $f(\boldsymbol \theta)$. An important economic application is estimation of DSGE models. However, the posterior distribution is often very complicated and difficult or impossible to integrate. For instance, we need to integrate to get the expectation of the posterior $E(\boldsymbol \theta|data) = \int_\theta \boldsymbol \theta f(\boldsymbol \theta|data) d\boldsymbol \theta$. A solution of this problem are sampling algorithms, which draw samples of $\boldsymbol \theta$ from an *arbitrary* *proposal distribution* in a way so that we finally get draws that are asymptotically distributed like the posterior distribution. If we have such a large sample of $\boldsymbol \theta$ from the posterior it is very easy to get the expectation of $\boldsymbol \theta$ (its simply the mean) or *credible intervals* (the Bayesian counterparts to confidence intervals). The *Markov Chain Monte Carlo* (MCMC), and in particular the *Metropolis Hastings* (MH) algorithms are such powerful sampling algorithms. By the way, this algorithms are also very important in many other fields like climate modeling, physics or big data econometrics. To learn more watch the following video: <https://www.youtube.com/watch?v=U561HGMWjcw> ------------------------------------------------------------------------ ## Count data Count data arise when the *dependent* variable is a *counting number* per unit of time or space -- e.g., the number of restaurants meals taken by a consumer during a week or the number of arrest of people during a year. So, $y_i = \lbrace 0, 1, 2, \ldots, \rbrace$ - If these numbers are large and widespread, the variable can be considered as continuous and hence, OLS would be appropriate - If these numbers are small, quite **rare** events, then the continuous approximation is a poor one - In this case, OLS can still yield consistent estimates and the predicted values are interpreted as conditional expected values of the independent variable - However, OLS does not account for the special structure of the data, leading to inefficiency and nonsense predictions, like -0.2 restaurant meals per week - One remedy to avoid *negative* predictions is to postulate $$ E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)} $$ - and apply NLS. However, this would be not efficient as the the error term is heteroscedastic - Therefore, *maximum likelihood* estimation is usually employed for this type of models. As count data are often Poisson distributed this leads to the **Poisson** regression model (or to the more general *negative binomial* regression model) ------------------------------------------------------------------------ - The **Poisson** distribution is derived form a limit of the binomial distribution, which we already encountered with the Logit and Probit models. If the number of independent (Bernoulli) draws goes to infinity and the success probability *p* goes to zero, so that $n\cdot p \rightarrow \lambda$, the binomial distrib. converges to $$ P(Y = y_i) = \dfrac {\lambda^{y_i}}{y_i!} e^{-\lambda} $$ And this is called Poisson distribution, with $y_i$ the number of counts - The parameter $\lambda$ is *equal to both the expectation and the variance* of counts! This distribution is considerably right-skewed for a small $\lambda$, but for $\lambda>30$, it's almost identical to a $N(\lambda, \lambda)$ normal distribution - As we are interested in the effects of the explanatory variables on *y* (the number of counts), we must look at the Poisson distribution *conditional* on $\mathbf x$. We only have to specify $\lambda = E[y \, | \, \mathbf x]$, which we set to $$ \lambda = E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)} $$ as above -- $\lambda$ has to be positive! ------------------------------------------------------------------------ Given a sample of *n* observations, we can construct the log-likelihood function, which is the joint log-probability that each $y_i$ is observed $$ \ln L(\boldsymbol \beta) \, = \, \sum\nolimits_{i=1}^{n} y_i \ln \lambda - \lambda - \ln(y_i!) \, = \, \sum\nolimits_{i=1}^{n} y_i \, {\mathbf x_i \boldsymbol \beta} - e^{\mathbf x_i \boldsymbol \beta} - \ln(y_i!) $$ Note, that the last term, $\ln(y_i!)$, is independent of $\boldsymbol \beta$ and could be omitted - The estimates $\hat{\boldsymbol \beta}$ are obtained by numerical maximization of the log-likelihood function - As we have $$ E[y \, | \, \mathbf x] = e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)} $$ the *partial effects* are $$ \dfrac {\partial E[y \, | \, \mathbf x]}{\partial x_j} = {\underbrace {e^{(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}}_{E[y \, | \, \mathbf x]}} \, \beta_j \ \ \Rightarrow $$ {#eq-count_partial_effect1} $$ \dfrac {\partial E[y \, | \, \mathbf x]}{\partial x_j} \dfrac {1}{E[y \, | \, \mathbf x]} \ = \ \dfrac {\partial \ln E[y \, | \, \mathbf x]}{\partial x_j}{} \ = \ \beta_j $$ {#eq-count_partial_effect2} This is a semi-elasticity. A change in $x_j$ of 0.01 leads to a $\beta_j$ percentage change in $E[y \, | \, \mathbf x]$. For larger variations in $x_j$ (e.g. for a dummy variable) use the exact formula $(e^{\beta_j \Delta x_j}-1)100=\Delta$% ------------------------------------------------------------------------ - The Poisson regression model assumes that $\operatorname{Var}(y|\mathbf x) = E(y|\mathbf x)$ which is very restrictive and often violated in applied work - If $\operatorname{Var}(y|\mathbf x) > E(y|\mathbf x)$, we have **overdispersion** (which is quite common), in the reverse case *underdispersion* (which is not so common) - Fortunately, MLE of the Poisson regression model are consistent and asymptotically normal even if the Poisson distribution does not hold. This is then called *Quasi-Maximum Likelihood estimation* (QML). (This is similar to the case of OLS when the disturbance is not normally distributed) - If the distributional assumption does not hold and QML is used, standard errors are wrong. One then has to compute adjusted standard errors - The residuals of the Poisson model are as usual $\hat u_i = y_i-\hat y_i$. To judge whether there is overdispersion the residuals are scaled (divided) by $\sqrt{\hat y_i}$. This is a proper heteroscedasticity correction. Additionally, $\hat y$ is a predictor for $E(y)$ and therefore also for $\operatorname{Var}(y|\mathbf x)$. Hence, the variance of these scaled residuals should be one if the condition $\operatorname{Var}(y|\mathbf x) = E(y|\mathbf x)$ is true - If the calculated standard error of these *transformed* residuals $\hat \sigma$ is *larger than one* we have overdispersion. In this case the estimated standard errors of the coefficients should be multiplied by $\hat \sigma$ ------------------------------------------------------------------------ ### Example - As an example we analyze the `crime1` data set of the `Wooldridge` Package. The **dependent** variable `narr86` is the number of times a man is arrested during 1986. This variable is zero for 1970 of the 2725 men in the sample and only eight values of `narr86` are larger than five. So, the Poisson regression model is more appropriate than OLS - Some interesting exploratory variables are - `pcnv`: proportion of prior convictions - `tottime`: time in prison since 18, in month - `ptime86`: month in prison during 1986 - `inc86`: legal income 1986, in \$100s - `black`: =1 if black - `hispan`: =1 if Hispanic - We estimate the model with *OLS*, as well as *Poisson* regression model and *Quasi-Poisson* regression model, the latter only differ by the correction of the standard errors because of overdispersion - Additionally, we calculate the APE, which are comparable to the OLS estimates ------------------------------------------------------------------------ - The OLS estimate for `pcnv` is -0.132 which means that with $\Delta$`pcvn` = 0.1, the expected number of arrest during 1986 decreases by 0.013 - The corresponding coefficient of the Poisson model is 0.402. But this is now a *semi-elasticity*: With $\Delta$`pcvn` = 0.1, the expected number of arrest during 1986 decreases by 4.02%. Calculating the APE (which averages @eq-count_partial_effect1) with the Package `marginaleffects`, we get the absolute effect of -0.162, which is now comparable to the OLS estimate - The effect of `black` in the OLS model is 0.327, meaning that for black men the expected number of convictions is 0.327 higher than for white men - The Poisson coefficient for `black` is 0.661. In this case (a large $\Delta x_j$), we should use the precise formula, $e^{\beta_j \Delta x_j}-1$, leading to $exp(0.661)-1$, which is about 94% higher than for white man. Calculating the APE (which averages @eq-count_partial_effect1), we get the effect in absolute terms, 0.328, which is about the same as the estimated OLS effect - The SE of the *transformed* residuals of the Poisson model is 1.232, showing a moderate overdispersion. The column QPoisson shows the corrected standard errors for the estimated $\beta_j$ - The estimated $R^2$, which is the squared correlation of actual versus predicted values of `narr86`, shows that the overall fit of the Poisson regression model is better than that of the linear regression model, although the OLS coefficients are chosen to maximize $R^2$. But OLS is restricted to a linear relationship ------------------------------------------------------------------------ ```{r} library(wooldridge) data("crime1") # OLS outlm <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + inc86 + black + hispan + born60 , data = crime1) # Poisson - family = poisson and the default link is "log", # so the inverse default link is "exp" outpoisson <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + inc86 + black + hispan + born60 , data = crime1, family = poisson) # Quasi-Poisson - this only adjusts the standard errors outquasipoisson <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + inc86 + black + hispan + born60 , data = crime1, family = quasipoisson) # Calculating APE with "marginaleffects" marg_lm <- avg_slopes(outlm) marg_pois <- avg_slopes(outpoisson) marg_qpois <- avg_slopes(outquasipoisson) ``` ------------------------------------------------------------------------ ```{r} #| code-fold: true #| label: tbl-counttbl #| tbl-cap: "Number of arrests, depending on serveral variables, several count models" # Reset for modelsummery for glm() glance_custom.glm <- function(x, ...) { } modelsummary( list("OLS"=outlm, "Poisson"=outpoisson, "Poisson APE"=marg_pois, "QPoisson"=outquasipoisson, "QPoisson APE"=marg_qpois), output="flextable", gof_omit = "F", align = "lddddd", stars = TRUE, fmt = 3) ``` ------------------------------------------------------------------------ ## Censored and truncated data - In many economic contexts, *decision problems* are such that either a positive amount or a zero amount is chosen, e.g., the demand for cars or alcohol (*corner solutions*). In this case, we have a form of **censored data** and the distribution of car expenditures is a *combination* of a discrete distribution (at 0) and a continuous distribution - Another form of *censored data* is when the *dependent* variable is censored from above or below a *certain threshold*, typically due to survey design; e.g. if respondents of a survey are allowed to respond a question about their wealth with *more than* 1 Mil. Typically, there is *no optimization problem modeled* - In the first example for censored data we have expenditure data for buyers *and* non-buyers (which are zero). However, if the data are collected from sales tax records, *then the data would only include buyers* - Data, in which *observations* are **unavailable** above or below a certain threshold are called **truncated data**. Note, for truncated data we *neither* **observe** $y_i$ *nor* $\mathbf x_i$, in contrast to censored data. The sample is **not random** any more if the **selection mechanism** is related to the **value of the dependent variable** - **Remark**: Selection mechanisms which are **exogenous**, i.e., the dependent variable and its error term are not involved, pose no special estimation problems - **Example**: - Suppose we want to estimate the relationship between the saving rate and income, but we have oversampled people with **high income** (exogenous sampling mechanism, which is based on an exogenous (rhs) variable). If the same relationship between savings and income is true for all people, this would not lead to biased estimates, (although with higher variance -- less variation in the explanatory variable) - However, if we have oversampled people with a **high saving rate** (endogenous sampling mechanism) this would lead to biased estimates even if the same relationship between savings and income is true for all people (sample selection problem) - A linear regression model may be inadequate for both censured and truncated data - @fig-cen_trunc below shows the effects of censored and truncated data for estimating the true (population) relationship. In both cases we see a significant bias towards zero in the estimates based on the observed data. Note, regarding the truncated sample, we have an endogenous sampling mechanism related to the value of $y$ -- thus there is a sample selection problem ------------------------------------------------------------------------ ```{r, fig.height=5.5, fig.width=9.5} #| label: fig-cen_trunc #| fig-align: center #| fig-cap: "Simulated data, censored or truncated, and the corresponding regression lines compared to the true relationship" #| code-fold: true library("ggplot2") library("dplyr") library("tidyverse") baw <- FALSE set.seed(2020) N <- 1000 X = rnorm(N, 3, 4) Y = - 1 + 0.5 * X + rnorm(N, 0, 1.8) z <- data.frame(X = X, Y = Y) z <- mutate(z, neg = factor(Y < 0, labels = c("yes", "no")), Yc = ifelse(Y > 0 , Y, 0), Yt = ifelse(Y > 0, Y, NA)) z2 <- gather(z, key, value, -X, - neg) colols <- ifelse(baw, "black", "#AA4444") ys <- c("Y" = "Whole sample", "Yc" = "Censored sample", "Yt" = "Truncated sample") gp <- ggplot(z2, aes(X, value)) + geom_point(size = 0.6, aes(colour = neg)) + geom_smooth(method = "lm", se = FALSE, lty = "twodash", color = colols) + geom_abline(intercept = -1, slope = 0.5) + facet_wrap(~ key, ncol = 3, labeller = labeller(key = ys)) + guides(colour = FALSE) + xlab("") + ylab("") + theme(legend.text = element_text(size = 6), legend.title= element_text(size = 8), axis.title = element_text(size = 8)) if (baw) gp <- gp + scale_color_grey() gp ``` ------------------------------------------------------------------------ #### Some examples for typical count and censored data ```{r} #| fig-align: center library(wooldridge); data("crime1") # Number of arrest - typical count data hist(crime1$narr86, main = "Typical count data, should look like a Poisson distribution") ``` ------------------------------------------------------------------------ ```{r} #| fig-align: center library(wooldridge); data("mroz") # Number of hours worked - not everybody work # Data apparently not Poisson distributed hist(mroz$hours, main = "Typical censored data, (>= 0)") ``` ------------------------------------------------------------------------ ```{r} #| fig-align: center library(wooldridge); data("recid") # Duration time until next arrest. We have different release times from prison, # hence different thresholds as the endpoint for observations is fixed hist(recid$durat, main = "Censored data from above with different but known thresholds") ``` ------------------------------------------------------------------------ ### Truncated Normal Regression Model Let's start with the analysis of **truncated** regression models. Characteristic for these is that the outcome $y_i$ *and* the explanatory variables $\mathbf x_i$ are **only observed** if the outcome $y_i$ is less (larger) or equal some value $c_i$ - In this case, the sample is not a random sample from the population (because some units will *never* be a part of the sample) - The truncated normal regression model is: $$ y_i = \mathbf x_i \boldsymbol \beta + u_i $$ $$ u_i \, | \, \mathbf x_i \sim N(0, \sigma^2) $$ $$ (y_i, \mathbf x_i) \ \text{ only observed if } \ y_i > c_i \ \ \ (\text{or } \ y_i < c_i) $$ {#eq-truncated_def} - Truncation from below or left: observe $y_i$ and $\mathbf x_i$ if $y_i > c_i$ - Truncation from above or right: observe $y_i$ and $\mathbf x_i$ if $y_i < c_i$ - In the following we examine truncation from below. Truncation from above is handled analogous and does not require any new results, @green2017 ------------------------------------------------------------------------ #### Truncated normal distribution ::: {#thm-trunc_normal_distrib} ## Truncated normal distribution, see @green2017 Let y be a normally distributed random variable with density $f(y)$, mean $E(y)=\mu$ and variance $\operatorname {Var}(y)=\sigma^2$ and a cutoff constant c. Then the **conditional density function** $f(y|y>c)$ is $$ f(y) \ \equiv \ f(y|y>c)\cdot P(y>c) \ \ \Rightarrow $$ $$ f(y|y>c) = \frac{f(y)}{P(y>c)} = \frac{f(y)}{P(y-\mu>c-\mu)} = $$ $$ \frac{f(y)}{P[(y-\mu)/\sigma > (c-\mu)/\sigma]} \, = \, \frac{f(y)}{1-F[(c-\mu)/\sigma]} \, = $$ $$ \frac{\frac{1}{\sigma}\phi \left( \dfrac{y-\mu}{\sigma}\right)}{1-\Phi\left(\dfrac{c-\mu}{\sigma}\right)} \ $$ {#eq-trunc_distrib} As already defined above, $\phi(z)$ denotes the **probability density** and $\Phi(z)$ the **cumulative distribution function** (cdf) of the **standard normally distribution** (see @eq-A_normal_distribited_density_stand and @eq-A_normal_distribited_cdf_stand). The **mean** and **variance** are given by: $$ \begin{align*}E(y|y>c) &= \mu + \sigma \, \lambda \left(\frac{\mu-c}{\sigma} \right) \\ \operatorname{Var}(y|y>c) &= \sigma^2 \left[ 1-\delta \left(\frac{\mu-c}{\sigma}\right) \right] \end{align*} $$ {#eq-trunc_E_Var} where $\lambda(x)$ is the **Inverse Mills Ratio** (or *hazard function* of the standard normal distribution): [^limdepvar-1] $$ \lambda(x) := \dfrac {\phi(x)}{\Phi(x)}, \ \ \lambda(x) >0, \ \lambda'(x) <0, \ \forall x $$ {#eq-inv_mills_ratio} $$ \delta(x) := \lambda(x)[\lambda(x)-x], \ \ 0<\delta(x)<1 $$ Remarks: $E(y|y>c) > E(y)$ and $\operatorname {Var}(y|y>c) < \operatorname {Var}(y)$. ::: [^limdepvar-1]: For a truncation from the right (from above) we have: $\lambda(x) =\dfrac {-\phi(x)}{\Phi(-x)}$. > **The Inverse Mills Ratio** $\lambda(x)$ (IMR) is fundamental in models with **incomplete data** or **self-selective samples**. It can be interpreted as a **measure of the probability that a particular observation is selected** (is in the sample) **compared to those that are not selected** (not in the sample). A lower IMR indicates a higher probability of selection. ------------------------------------------------------------------------ ```{r, fig.height=5.5} #| label: fig-trunc_normdistrib #| fig-align: center #| fig-cap: "Normal and truncated normal distributions for c=-0.5 and c=0.5. Note, the area below the density has to be always one" #| code-fold: true x <- seq(-3, 3, length=100) # plotting normal distribution plot(x, dnorm(x), type="l", ylim=c(0, 1.125)) # plotting truncated normal distribution, truncated from below at -0.5 a=-0.5 xt <- seq(a, 3, length=100) lines(xt, dnorm(xt)/(1-pnorm(a)) ) o <- dnorm(a)/(1-pnorm(a)) segments( a,0, x1=a, y1=o, lty = 3) # plotting truncated normal distribution, truncated from below at 0.5 a=0.5 xt <- seq(a, 3, length=100) lines(xt, dnorm(xt)/(1-pnorm(a)) ) o <- dnorm(a)/(1-pnorm(a)) segments( a,0, x1=a, y1=o, lty = 3) ``` ------------------------------------------------------------------------ #### Estimation Applying OLS would not yield correct results because MLR.2 is violated (see @sec-MLR). To see this note that, according to the theorem above, the expectation of $y_i$ for the observed **subpopulation** $y_i > c_i$ is: > $$ > E(y_i|y>c_i) \ = \ > \mathbf x_i \boldsymbol \beta + \sigma \lambda \left( \frac{\mathbf x_i \boldsymbol \beta - c_i }{\sigma} \right) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} > $$ {#eq-trunc_E} - Hence, regressing $y_i$ only on $\mathbf x_i$ leads to an omitted variable bias as the omitted variable $\lambda(\mathbf x_i)$ (the *inverse Mills Ratio*, @eq-inv_mills_ratio) is a nonlinear function of $\mathbf x_i$ and therefore probably correlated with $\mathbf x_i$ - In particular, for observations with a low a-priory probability to be in the sample (because $\mathbf x_i \boldsymbol \beta$ is low), the variable $\lambda(\mathbf x_i)$ is high, thereby correcting for the selection bias in @eq-trunc_E The **partial effect** of $x_j$ on $E(y_i|y>c_i)$, which we want to estimate, is: $$ \dfrac {\partial E(y_i|y>c_i)}{\partial x_j} \, = \, \beta_j + \sigma \, \lambda' \, \dfrac {\partial \left( \frac{\mathbf x_i \boldsymbol \beta - c_i }{\sigma} \right) }{\partial x_j} \, = \, \beta_j(1 + \underset {(-)} {\lambda'}) $$ {#eq-trunc_partialeffect} - This can be shown to be less than $\beta_j$. Hence, we have a **bias towards zero** (compare @fig-cen_trunc) [^limdepvar-2] [^limdepvar-2]: Hint: Let $\alpha = (\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma})$. Then $\lambda'(\alpha)=\frac {-\alpha \phi(\alpha) \Phi(\alpha) - \phi(\alpha)^2}{\Phi(\alpha)^2} = -\alpha \lambda(\alpha) - \lambda(\alpha)^2 = (-)$ and one can show that: $-1<\lambda'(\alpha)<0$. ------------------------------------------------------------------------ As just shown, OLS is biased and inconsistent for truncated data because of an omitted variable problem. Thus, **Maximum Likelihood estimation** seems appropriate. - According to @thm-trunc_normal_distrib, the density of an observed outcome $y_i$, conditional on the explanatory variables **and** the $y > c_i$ is: $$ f\left(y_{i} \, | \, \mathbf{x}_{i}, y_i > c_{i}\right) \ = \ \frac{\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} {1 - \Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} $$ {#eq-trunc_distrib1} Maximizing the log-Likelihood function yields *consistent estimates* of $\hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2$ $$ \underset{\boldsymbol \beta, \sigma}{\max} \ \ln L(\boldsymbol \beta, \sigma^2) \ = \ \sum_{i=1}^n \ln \frac {1}{\sigma }{\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} - \ln \left[ 1 - {\Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \right] $$ $$ \rightarrow \ \ \ \hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2 $$ - Note, **that non-normality or heteroscedasticity in** $u_i$ lead to inconsistent estimates - Further note, the untruncated data should be *approximately continuous* to be appropriately modeled with this approach (not count data with only few counts!) ------------------------------------------------------------------------ #### Example ```{r} #| code-fold: true # We simulate data and store them in the data.frame d set.seed(1234890) n <- 10000 # true parameters, which we want to estimate sigma <- 4 beta_0 <- 2 beta_1 <- 1; x <- rnorm(n, mean = 0, sd = 2); eps <- rnorm(n, sd = sigma) # the true model y <- beta_0 + beta_1 * x + eps d <- data.frame(y = y, x = x) ``` ```{r} #| comment: " " library(truncreg) # Truncated Regression package # generating truncated data y = yt, at c <= 1 d$yt <- ifelse(d$y > 1, d$y, NA) # compare estimates for full/truncated/ data # we use the function `truncreg` from the package `truncreg` full <- lm(y ~ x, data = d) ols <- lm(yt ~ x, data = d) trunc <- truncreg(yt ~ x, data = d, point = 1, direction = "left") summary(trunc) ``` ------------------------------------------------------------------------ ```{r} #| code-fold: true #| label: tbl-trunctbl #| tbl-cap: "Simulated truncated data, estimated with OLS and MLE. True values are b0=2, b1=1 and sigma=4" modelsummary( list("OLS full data"=full, "OLS trunc data"=ols, "MLE trunc data"=trunc), output="flextable", gof_map = c("nobs", "aic", "bic", "rmse"), align = "lddd", stars = TRUE, fmt = 3) ``` - Compare @fig-cen_trunc ------------------------------------------------------------------------ ### Censored Normal Regression Model - Generally, **censored data** do not result from a truncation problem. Contrary to truncated data, especially the explanatory variables $\mathbf x_i$ *are observed even if* the outcome $y_i$ is less (larger) or equal some value $c_i$ - Typically, a dependent variable is censored from above or below a certain threshold due to **survey design**; e.g., values are only reported up to a certain level (e.g., top coded wealth) - Another example for censoring from below is limited measurement accuracy, for instance tumor maker values or count of antibodies in medical data - With $y_i^*$ being the **true** outcome and $y_i$ being the **observed** outcome, the *censored normal regression model* is $$ y_i^* = \mathbf x_i \boldsymbol \beta + u_i $$ $$ u_i \, | \, \mathbf x_i \sim N(0, \, \sigma^2) $$ $$ y_i = \begin{cases} \max(y_i^*,c_i) \ \ \ \text{:left censoring (from below)} \\ \min(y_i^*,c_i) \ \ \ \ \text{:right censoring (from above)} \\ \max(y_i^*,\underline {c}_i) \ \land \ \min(y_i^*,\bar c_i) \ \ \ \text{:censoring from above and below} \end{cases} $$ {#eq-ceseored_def} - In the following we examine censoring from below. Censoring from above is handled analogous ------------------------------------------------------------------------ #### Distribution of censored data ```{r, echo=FALSE, fig.height=5.5} #| label: fig-censored_normdistrib #| fig-align: center #| fig-cap: "Censored normal distributions at c=0, mixture of discrete & continuous distribution. In this example, half of the probabilty mass is concentrated in one point" #| code-fold: true x <- seq(-3, 3, length=100) # plotting normal distribution plot(x, dnorm(x), type="l", lty = 3, ylim=c(0, 0.55), xlim = c(-3,3)) # plotting truncated normal distribution, truncated from below at -0.2 a=-0.0 xt <- seq(a, 3, length=100) lines(xt, dnorm(xt), lwd=3.2 ) # plotting line and point at -0.2 o <- pnorm(a) segments( a, 0, x1=a, y1=o, lty = 3) points(a,o, pch = 19, cex=1.5) ``` ------------------------------------------------------------------------ For **censored data**, the distribution function of $y_i$ is a **mixture** of a discrete distribution, concentrated at the threshold $c_i$, and a truncated continuous density; the normal density in our case - Such a distribution is shown in @fig-censored_normdistrib above with threshold $c_i=0$ - Note, in this case no renormalization of the truncated continuous density is necessary, because the area below the truncated density function **plus** the probability of the discrete mass point is one - We have **two parts** of the distribution of $y_i$ (all conditional on $\mathbf x_i$, which we leave out to save notation): 1. For (left) **censored observations**, $y_i = c_i$, we have a **discrete mass point** with probability: $$ P(y_i = c_i) \,=\, P(y_i^* \leq c_i) \,=\, P(u_i \leq c_i - \mathbf x_i \boldsymbol \beta) \,=\, P \left( \frac {u_i}{\sigma} \leq \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \,= $$ $$ \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) $$ {#eq-censored_distrib1} 2. For the **uncensored observations**, $y_i = y_i^* > c_i$, the density of $y_i$, conditional on $y_i>c_i$, corresponds to a **truncated** density $f\left(y_{i} \, | \, y_i > c_{i}\right)$, already derived in the section on truncated data, see @eq-trunc_distrib and @eq-trunc_distrib1 - For the **whole two-part** distribution of $y_i$ we have $$ f(y_i \,|\, \mathbf x_i, c_i) \ = \begin{cases} P(y_i=c_i) = \Phi \left( \dfrac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right); \ \ \text{ if } y_i=c_i \\ f(y_{i} \, | \, y_i > c_{i}) \cdot P(y_i>c_i); \ \ \text{ if } y_i>c_i \end{cases} $$ {#eq-cesored_distribution_def} - The **second part** simplifies to $$ f(y_{i} \, | \, y_i > c_{i}) \cdot P(y_i>c_i) \ = \ \underbrace {\frac{\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} {1 - \Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)}}_{ f(y_{i} \, | \, y_i > c_{i})} \underbrace {\left [1 - \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \right ]}_{P(y_i>c_i)} \ = $$ $$ \underbrace {\frac {1}{\sigma }\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)}_{f(y_i^*)=N(\mathbf x_i \boldsymbol \beta, \, \sigma^2)} $$ {#eq-censored_distrib2} - And this is simply the *non*-*truncated* normal distribution of the uncensored observations (as mentioned above, there is no need for a renormalization) ------------------------------------------------------------------------ #### Estimation This lead to the joint density of the observed vector $\mathbf y$: $$ f(\mathbf y \, | \, \mathbf X,c_i) \ = \ \prod_{y_i=c_i} \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \cdot \prod_{y_i>c_i} \frac{1}{\sigma} \phi \left( \frac {y_i-\mathbf x_i \boldsymbol \beta}{\sigma} \right) $$ {#eq-joint_density} - Maximizing the log-likelihood function yields *consistent estimates* of $\hat \beta_0, \hat \beta_1, \ldots , \hat \beta_k, \hat \sigma^2$ $$ \underset{\boldsymbol \beta, \sigma}{\max} \ \ln L(\boldsymbol \beta, \sigma^2) \ = \ \sum_{y_i=c_i} \ln \left[ {\Phi \left( \frac {c_i - \mathbf{x}_{i} \boldsymbol{\beta}} {\sigma }\right)} \right] \ + \ \sum_{y_i>c_i}^n \ln \frac {1}{\sigma }{\phi \left( \frac {y_{i} - \mathbf x_i \boldsymbol \beta} {\sigma}\right)} \ \ \rightarrow \ \ \hat {\boldsymbol \beta}, \hat \sigma^2 $$ - Estimation has to be done by numerical optimization methods - After estimation, parameter test and exclusion restrictions can be done as usual with *Wald*, *Likelihood Ratio* or *Lagrange Multiplier* tests - Note, **if the distributional assumptions on** $u_i$ (normality and homoscedasticity) are violated, the MLE above is inconsistent - Further note, the uncensored data should be *approximately continuous* to be appropriately modeled with this approach (not count data with only few counts!) ------------------------------------------------------------------------ OLS applied to the observed data doesn't yield consistent estimates for $\boldsymbol \beta$ as the expectation of $y_i$, conditional on $\mathbf x_i$ and $y_i\geq c_i$, is not $\mathbf x_i \boldsymbol \beta$ but a highly nonlinear function in $\mathbf x_i$ and $\sigma$ $$ E(y_i) \ = \ E(y_i|y \geq c_i) \ = \ E(y_i|y_i > c_i) \cdot P(y_i > c_i) \, + \, \underbrace {E(y_i|y_i = c_i)}_{c_i} \cdot P(y_i = c_i) $$ Plugging in our known results regarding the truncation case $E(y_i|y_i > c_i)$ (@eq-trunc_E) and for $P(y_i = c_i)$ (@eq-censored_distrib1), respectively $P(y_i > c_i)$, we have $$ E(y_i) \ = \ \underbrace {\left[ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \right]}_{E(y_i|y_i > c_i)} \underbrace {\left[ 1 -\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \right]}_{P(y_i>c_i) \, = \, [1-P(y_i=c_i)]} \, + \, c_i \, \underbrace {\Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right)}_{P(y_i=c_i)} \ = $$ $$ \left[ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta - c_i}{\sigma} )} \right] \underbrace {\left[ \Phi \left( \frac {\mathbf x_i \boldsymbol \beta - c_i}{\sigma} \right) \right]}_{P(y_i>c_i)} \, + \, c_i \, \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) \ = $$ $$ \mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta - c_i} {\sigma} \right) \, + \, c_i \, \Phi \left( \frac {c_i - \mathbf x_i \boldsymbol \beta}{\sigma} \right) $$ {#eq-cesored_E} ------------------------------------------------------------------------ #### Example continued ```{r} #| code-fold: true # simulated data - example from above continued set.seed(1234890) n <- 1500 # fewer observation to better judge efficiency # true parameters, which we want to estimate sigma <- 4 beta_0 <- 2 beta_1 <- 1; x <- rnorm(n, mean = 0, sd = 2); eps <- rnorm(n, sd = sigma) # the true model y <- beta_0 + beta_1 * x + eps d <- data.frame(y = y, x = x) ``` ```{r, message=FALSE, warning=FALSE} #| comment: " " library(truncreg) # Truncated Regression package library(censReg) # Censored Regression package ## truncated y = yt, censored y = yc, at c = 1 d$yt <- ifelse(d$y > 1, d$y, NA) d$yc <- ifelse(d$y > 1, d$y, 1) ## compare estimates for full/truncated/censored data # ols full data full <- lm(y ~ x, data = d) # ols truncated data ols <- lm(yt ~ x, data = d) # ols censored data olsc <- lm(yc ~ x, data = d) # MLE truncated data trunc <- truncreg(yt ~ x, data = d, point = 1, direction = "left") # MLE censored data censored <- censReg(yc ~ x, data = d, left = 1) summary(censored) ``` ------------------------------------------------------------------------ ```{r} #| code-fold: true #| label: tbl-censoredtbl #| tbl-cap: "Simulated truncated and sensored data, estimated with OLS and MLE. True values are b0=2, b1=1 and sigma=4. Note, we now have only 1500 observations, compared to 10000 in the previous example. This shows that censored reg is more efficient than truncated reg" modelsummary( list("OLS full data"=full, "OLS trunc data"=ols, "MLE trunc data"=trunc, "OLS cens data"=olsc, "MLE cens data"=censored), output="flextable", gof_map = c("nobs", "aic", "bic", "rmse"), align = "lddddd", stars = TRUE, fmt = 3) ``` ------------------------------------------------------------------------ ### Tobit model - Strictly speaking, the **Tobit** model, @tobin1958estimation, is **formally identical** to a **censored normal regression model**. However, there are two distinct features, one minor and one important - In most cases, the Tobit model is formulated with left-censoring and $c_i=0$. This is not important and can easily be changed - More importantly, the Tobit model is based on an **optimization model with corner solutions** and **not** on censoring of data above or below a certain threshold due to certain **survey designs** - This different starting point has serious consequences for interpreting the estimation results, especially for calculating the **relevant partial effects** - Let us consider an example for an **underlying optimization problem** with corner solutions for some agents $i$ which maximize their utility $U(y_i, z_i)$ subject to a budget constraint $p_yy_i+p_zz_i \leq I_i$ - $y_i$ and $z_i$ could be leisure or working time versus consumption of goods, or consumption of tobacco versus other goods, or consumption today versus consumption tomorrow, ect. The critical point are additional **non-negativity constraints**; it is not possible to choose negative values, hence, $y_i≥0$ and $z_i≥0$ ------------------------------------------------------------------------ Without the non-negativity constraints the solution of this optimization problem for $y_i$ is (assuming a linear solution which is to be expected with certain utility functions) $$ y_i^* = f(p_y, p_z, I_i, u_i) = \mathbf x_i \boldsymbol \beta + u_i $$ - Here, $y_i^*$ represents the *potential* choice, *without* non-negativity constraints and $u_i$ represents unobserved heterogeneity of the individual utility function. However, invoking the non-negativity constraints, we arrive to a standard censored regression model, with $y_i$ being the observed values $$ y_i^* = \mathbf x_i \boldsymbol \beta + u_i \\ y_i = y_i^* \quad \text{if } \ y_i^* > 0 \\ y_i = 0 \quad \ \text{if } \ y_i^* ≤ 0 $$ - The difference to the standard censored regression model is that in this case, the non-negativity arises because the agent simply can not choose negative values of a good - and not because the *data* are censored for some reasons ------------------------------------------------------------------------ This has consequences for the interpretation of the model. In the **standard censored regression** model we are interested in the **population parameter** $\boldsymbol \beta$, which, in the case of censored or truncated data can not be consistently estimated by OLS. Therefore, we need a maximum likelihood approach In the Tobit model, with an underlying optimization problem with non-negativity constraints, we are typically interested on **how much the value of the actual observed variable** $y$ (labor supply in hours or expenditures on cars or tobacco) **changes** due to changes of the explanatory variables. #### Partial effects In particular, there are actually **three different partial effects** present in model 1. The effect on the expectation of the **latent variable** -- therefore on **potential choices** (without non-negativity restriction) and not on **actual choices**. Typically, this is not what we want in this case $$ \dfrac {\partial E(y^*|\mathbf x)}{\partial x_j} = \beta_j $$ 2. The effect on expected **actual choices** $y$, but only for **uncensored** observations, e.g., the choice of **already employed** people on their hours worked if wage is changing, or the choice about expenditures on tobacco for people which **actually are smokers**. Sometimes, this is what we want $$ \dfrac {\partial E(y|\mathbf x, y>0)}{\partial x_j} = \ ? $$ 3. This is the effect on expected **actual choices** $y$, but for **all** actual values, *independently whether they are censored or not*. Most of the time, this is what we want $$ \dfrac {\partial E(y|\mathbf x, y≥0)}{\partial x_j} = \dfrac {\partial E(y|\mathbf x)}{\partial x_j}= \ ? $$ ------------------------------------------------------------------------ Hence, typically, it is the last partial effect we are interested in. For instance, if the wage or the number of children is changing, we want to know the effect on the supplied hours worked coming form *already employed as well as form people which are not currently working*. - Therefore, the last partial effect also incorporates the effect on the probability for doing something or not (e.g., working, smoking, driving with a car, etc.) $$ \dfrac {\partial P(y_i>0|\mathbf x_i)}{\partial x_j} \, = \, \dfrac {\partial \Phi(\mathbf x_i \boldsymbol \beta/\sigma)}{\partial x_j} \, = \, \dfrac {\beta_j}{\sigma} \phi(\mathbf x_i \boldsymbol \beta/\sigma) $$ {#eq-tobit_partial_eff} We already calculated the expectation $E(y_i| \mathbf x_i, y_i≥0)$, (@eq-cesored_E). But here, we are dealing with the case of $c_i=0$, so the third term in @eq-cesored_E is zero $$ E(y_i| \mathbf x_i, y_i≥0) \ = \ E(y_i| \mathbf x_i) \ = \ \mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \ = $$ $$ \Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \left[ \mathbf x_i \boldsymbol \beta \, + \, \sigma \, \lambda \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \right] $$ - The surprisingly simple partial effect is $$ \dfrac {\partial E(y_i|\mathbf x_i, y_i≥0)}{\partial x_j} \ = \ \beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) $$ {#eq-tobit_APE} - Hence, the corresponding coefficient $\beta_j$ has to be multiplied by a unit specific **adjustment factor** $\Phi(\mathbf x_i \boldsymbol \beta / \sigma)$, which is clearly positive and less than 1 - For calculating the APE, we use the average over all observations of this adjustment factor, i.e., $$ \widehat{A P E}_{j} = \hat\beta_{j} \, \frac{1}{n} \sum_{i=1}^{n} \Phi \left(\frac {\mathbf{x}_{i} \hat{\boldsymbol{\beta}}}{\sigma} \right) $$ {#eq-tobit_APE1} - For the PEA we use $\hat\beta_{j} \, \Phi(\bar{\mathbf x} \boldsymbol \beta / \sigma)$ ------------------------------------------------------------------------ ##### Proof for the partial effect: - Note, that $d \Phi(z) / d z = \phi(z)$ and $d \phi(z) / d z = -z \phi(z)$ $$ \dfrac {\partial }{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta \, \Phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \, + \, \sigma \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \right ] \ = $$ $$ \beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) + \mathbf x_i \boldsymbol \beta \, \phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \frac {\beta_j}{\sigma} - \sigma\left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right) \frac {\beta_j}{\sigma} \ = $$ $$ \beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) + \mathbf x_i \boldsymbol \beta \, \phi \left( \frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right)\frac {\beta_j}{\sigma} - \left (\frac{\mathbf x_i \boldsymbol \beta} {1} \right) \, \phi \left (\frac{\mathbf x_i \boldsymbol \beta} {\sigma} \right)\frac {\beta_j}{\sigma} \ = $$ $$ \beta_j \Phi \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) $$ ------------------------------------------------------------------------ Sometimes, we only want the effect on the choices of *already employed* people on their hours worked in response to changing wages, or the choice about expenditures on tobacco for people which *actually are smokers* (the second partial effect mentioned above). In this case we need $$ E(y_i|\mathbf x_i, y_i>0) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \ = \ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )} $$ which we already encountered this in @eq-trunc_E, dealing with the *truncated normal distribution*. Remember, the function $\lambda(z)$ is the *inverse Mills ratio* (@eq-inv_mills_ratio) and equals to $\phi(z)/\Phi(z)$ - Then, the partial effect is $$ \dfrac {\partial}{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) }\right] \ = \beta_j \left[1 - \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) - \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)^2 \right] $$ {#eq-tobit_partial_eff1} - Once again, the corresponding coefficient $\beta_j$ has to be multiplied by a unit specific **adjustment factor** $[ 1 - ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) - \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} )^2 ]$, which can be shown to be strictly between zero and one. For calculating the APE, we use the average over all observations of this adjustment factor. For the PEA we simply calculate the expression at the mean $\bar{\mathbf x}$ ------------------------------------------------------------------------ ##### Proof for the partial effect: Once again note, that $d \Phi(z) / dz = \phi(z)$, $\; d \phi(z) / d z = -z \phi(z) \,$ and $\, \lambda(z)=\phi(z)/\Phi(z)$ $$ \text{ } \\ \dfrac {\partial}{\partial x_j} \left [ \mathbf x_i \boldsymbol \beta + \sigma \, \dfrac {\phi(\frac{\mathbf x_i \boldsymbol \beta} {\sigma})}{\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) }\right] \ = $$ $$ \beta_j + \dfrac{- \sigma ( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma} \, \Phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) - \sigma \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma}\phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )} {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } = $$ $$ \beta_j + \left[ - \dfrac{ \sigma ( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \, \frac {\beta_j}{\sigma}\Phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) } {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } - \dfrac{ \sigma \phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} ) \frac {\beta_j}{\sigma} \phi (\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )} {\Phi(\frac{\mathbf x_i \boldsymbol \beta}{\sigma} )^2 } \right] \ = $$ $$ \beta_j \left[1 - \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right) - \lambda \left( \frac{\mathbf x_i \boldsymbol \beta}{\sigma} \right)^2 \right] $$ ------------------------------------------------------------------------ ### Example Now, we are estimating a concrete example using the MROZ data, which we have already studied in the context of Logit and Probit models. We are interested in the number of working hours supplied by married woman, `hours`, and the effects of several explanatory variables - These variables are `nwifeinc` = non wife income, `educ` = education in years, `exper` = experience in years, `age`, `kidslt6` = number of kids younger than 6 and `kidsge6` = number of kids older than 6 - We estimate the model with OLS, Tobit and Probit. The reason for the Probit estimation is a *special feature of the Tobit model*: - The Tobit estimates implicitly incorporates the choice *of doing something* (for instance, the probability of working, depending on some variables) as well as the response of the *intensity* of doing something (e.g., the response of working time on education) - These two effects are: ${\partial P(y_i>0|\mathbf x_i)}/{\partial x_j} = (\beta_j/\sigma) \phi(\mathbf x_i \boldsymbol \beta/\sigma)$ (@eq-censored_distrib1, @eq-tobit_partial_eff) and ${\partial E(y_i|\mathbf x_i, y_i>0)}/{\partial x_j} = \beta_j [ 1 - ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} ) - \lambda ( {\mathbf x_i \boldsymbol \beta}/{\sigma} )^2 ]$ (@eq-tobit_partial_eff1) and hence, are represented by the same parameter $\beta_j$ times a positive constant - Therefore, every variable in the model affects the **participation probability** and the **intensity equation** with the **same sign**. Further, the Probit parameters should be approximately equal to $\hat \beta_j/\hat \sigma$ in terms of the Tobit estimates. So, it is always a good idea to additionally estimate a corresponding Probit model (coding $y_i > 0$ as $1$) and check whether the estimated coefficient actually have the same sign. If there are some coefficients with statistically significant different signs, then the Tobit model might not be appropriate ------------------------------------------------------------------------ ```{r, message=FALSE} #| comment: " " library(boot) library(wooldridge) data("mroz") # OLS outols <- glm(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, x=TRUE) # Tobit model - default of `tobit` from `AER` package: left = 0. # You could also use `censReg` -> same result outtobit <- tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz) # with censReg - same result but logSigma is reported (but avg_slopes does not work) #outtobit1 <- censReg(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz) # you can calulate PEA # margEff(outtobit1) ``` ```{r} #| eval: true #| echo: false ## function to report scale and number of cencored observations in `modelsummary` glance_custom.tobit <- function(x, ...) { s <- summary(x, ...) sc <- s$coefficients scale = round(s$scale,2) total = s$n[1] LCens = s$n[2] UCens = s$n[3] RCens = s$n[4] logScale = round(s$coefficients["Log(scale)",1], 3) se_logScale = round(s$coefficients["Log(scale)",2], 3) t_logScale = s$coefficients["Log(scale)",1] / (s$coefficients["Log(scale)",2]) pval_logScale = round( (1-pnorm(abs(t_logScale)))*2, 3) # Alternative with p-values in brackets # pval_logScale1 = ifelse(pval_logScale<0.01, pval_logScale, "<0.01") if(pval_logScale>0.1) pval_logScale1=scale else if (pval_logScale<=0.1 & pval_logScale>0.05) pval_logScale1=paste0(scale, "+") else if (pval_logScale<=0.05 & pval_logScale>0.01) pval_logScale1=paste0(scale, "*") else if (pval_logScale<=0.01 & pval_logScale>0.001) pval_logScale1=paste0(scale, "**") else if (pval_logScale<=0.001) pval_logScale1=paste0(scale, "***") ret <- data.frame( #"Scale (pval)" = paste0( scale, " (", pval_logScale1, ")"), Scale = pval_logScale1, Total = total, Left_censored = LCens, Un_censored = UCens, Right_censored = RCens ) return(ret) } # glance_custom.tobit(outtobit) ``` ```{r} #| code-fold: true ########## for APEs for modelsummary() ## bootstrap only for std.errors necessary bootbdy <- function(data, indices) { dat <- data[indices,] # resampled data out <- tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = dat) m_tobit <- avg_slopes(out) fac <- mean( pnorm( predict(out)/out$scale ), na.rm = TRUE ) # scaling factor, eq. 10.7 m_tobit["estimate"] = m_tobit["estimate"]*fac return( coef(m_tobit) ) } bootresult_ <- boot(data=mroz, statistic=bootbdy, R=500) se_boot <- matrixStats::colSds(bootresult_$t) # for debugging, should be the same as mm mm = summary(bootresult_) # to calculate std.errors ## to use `modelsummary`, making a data.frame (with naming according broom package) m_tobit1 = data.frame(term=names(bootresult_$t0), estimate=mm$original, std.error=mm$bootSE, statistic=mm$original/mm$bootSE, p.value= (1 - pnorm( abs(mm$original/mm$bootSE) ))*2 ) ## modelsummary needs a special format of a list with tidy and glance m_tobit <- list( tidy = m_tobit1, glance = data.frame(NA) ) class(m_tobit) <- "modelsummary_list" ############## # Probit model - to check for misspecification outprobit <- glm( inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz, family = binomial(link = "probit")) # Average partial effects for Probit with 'avg_slopes()' marg_probit <- avg_slopes(outprobit) # Average partial effects for Tobit with 'avg_slopes()', but without correction factor fac marg_tobit <- avg_slopes(outtobit) ``` ------------------------------------------------------------------------ ```{r} #| comment: " " #| code-fold: true #| label: fig-tobit_APE #| fig-cap: Number of working hours supplied by married woman, `hours`, and the effects of several explanatory variables, estimated with OLS, Tobit, Tobit-APEs (@eq-tobit_APE1) and a Probit model for checking signs of parameters rows <- tibble::tribble(~term, ~Bivariate, ~Multivariate, ~ab, ~cd, ~ef, 'Empty row', '-', '-', '-', '-', '-', 'Another empty row', '?', '?', '?', '?', '?') #attr(rows, 'position') <- c(1, 6) modelsummary( list("OLS"=outols, "Tobit"=outtobit, "Tobit APE"=m_tobit, "Probit"=outprobit, "Probit APE"=marg_probit), output="gt", #statistic = c('std.error', 'statistic', 'p.value'), #coef_omit = "Inter", gof_omit = "F|N", align = "lddddd", stars = TRUE, fmt = 3, add_rows = rows ) ``` *** ```{r} #| code-fold: true #| eval: false #### Code for `texreg` procedure - not reported library(texreg) # Calculating APEs for tobit model # Basically with: # fav <- mean( pnorm( predict(outtobit)/outtobit$scale ), na.rm = TRUE ) # APE <- fav * coef(outtobit) source('~/Dropbox/R/User Functions/extract.margins.R') # user function for texreg source('~/Dropbox/R/User Functions/extract.tobitMargins.R') # user function for texreg # APE for Tobit with bootstrapping for standard errors. User fn tobitAPE is integrated in extract.tobitMargins.R bootbody <- function(data, indices) { out <- tobit(hours ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz[indices,]) return( tobitAPE(out) ) } bootresult <- boot(data=mroz, statistic=bootbody, R=1000) # preparing for texreg to know tobit_output (append classes: "tobitMarg","APE") tobit_out <- outtobit class(tobit_out) <- c("tobitMarg","APE", "tobit", "survreg", "margins") screenreg( list(outols, outtobit, tobit_out, outprobit, marg_probit ), custom.model.names = c("OLS", "Tobit", "Tobit APE", "Probit", "Probit APE"), digits = 3, bootresults=bootresult, discrete=c(9) ) ``` ------------------------------------------------------------------------ - Because of the different scaling factors involved, Tobit coefficients are not directly comparable to OLS coefficients. For instance, look at the OLS estimate for $\beta_{educ}$ which is 28.8 and the Tobit estimate of 80.6. But this is the estimated effect on the latent variable, which is not really relevant in this context - To compare Tobit and OLS, one has to compare average partial effects, APEs, @eq-tobit_APE. It turns out that the partial effects of Tobit and OLS are different not in all, but in a number of cases. For instance, the effect of education on $E(y|\mathbf x)$, the expectation of *observed* hours worked, is estimated to be 47.5 - Another difference between Tobit and OLS is that, due to the linearity of the model, OLS assumes *constant* partial effects, whereas partial effects are *nonconstant* in Tobit. To demonstrate this, look at @fig-different_predictors below. There, the different predictors for hours worked are displayed, along with the observed points. The most relevant one is the blue line which represents $E(y|\mathbf x)$. Note, how the blue line converges to the red one, the linear predictor $\mathbf x \hat {\boldsymbol \beta}$ of the Tobit model, as the censoring becomes less imortant. Further note, the OLS predictor approximate $E(y|\mathbf x)$ quite good for some range of $\mathbf x$ - Lets us compare the estimates of the Tobit model with the Probit ones. It turns out that every coefficient has the same sign. Further, dividing the Tobit estimate for $\beta_{educ}$ by the Tobit estimate for $\sigma$ (which is 1122) we get 0.072 compared to 0.13 for Probit. Doing the same for $\beta_{kidslt6}$ we get -0.8 compared to -0.87. This is not a huge difference but shows that the initial decision to work is stronger affected by education or small kids than the decision about the hours worked once in the labor force ------------------------------------------------------------------------ ```{r} #| fig-align: center #| fig-width: 8 #| fig-asp: 0.8 #| code-fold: true #| label: fig-different_predictors #| fig-cap: "E(y|y≥0) (blue), linear tobit predictor X*beta (red), OLS predictor (purple) and E(y|y>0) (light blue)" tobitPredict <- function(x, model) { # calculates pred values for tobit model pred <- pnorm( x/model$scale ) * x + model$scale * dnorm( x/model$scale ) return(pred) } tobitPredict0 <- function(x, model) { # calculates Lambda values for tobit model pred0 <- x + model$scale * dnorm( x/model$scale ) / pnorm( x/model$scale ) return(pred0) } olsout2 <- lm(mroz$hours ~ outtobit$linear.predictors) plot( outtobit$linear.predictors, mroz$hours, ylim=c(-1000, 5000), pch = 16, cex=0.9, col=c("#666666", "#666666", "#666666"), xlab="Linear tobit predictor X*beta", ylab="Hours and predictors of hours", main = "Different predictors for hours worked" ) abline(c(0,0),c(1,1), col="red", lwd=2.5) curve(tobitPredict(x, outtobit), from = -3000, to = 2200, n = 150, add = TRUE, type = "l", col="blue", lwd = 4.5) abline(olsout2, col="purple", lwd =2) abline(0,0) curve(tobitPredict0(x, outtobit), from = -3000, to = 2200, n = 150, add = TRUE, type = "l", col="cornflowerblue", lwd = 3) text(-2500, 640, "E(y|y>0)") text(-600, -950, "E(y*)") text(-1100, 280, "E(y|y≥0) = E(y)") text(-1950, -690, "OLS") ``` ------------------------------------------------------------------------ ## Incidental truncation Until now, we dealed with the problem of truncated and censored data which is **directly linked** to the **dependent** variable $y$ and lead to a sample that is not random any more; this could renders OLS estimates inconsistent. - Remember, sample selection poses **no problem** if it is solely linked to exogenous variables or is exogenous to the errors [^limdepvar-3] - To formally show this result, define the **indicator** $s_i$, which is one if the observation $i$ is in the sample and otherwise is zero. Then we have for $s_i=1$ (for fully observed observations, including $\mathbf x_i$) [^limdepvar-3]: The following results carry over to IV-estimations as well, if $E(u|\mathbf z, s)=0$, with $\mathbf z$ being the (external + internal) instruments. $$ s_i y_i \, = \, s_i \mathbf x_i \boldsymbol \beta + s_i u_i \ \ \Rightarrow $$ $$ E(s_i y_i|\mathbf x_i, s_i=1) \, = \, E(s_i \mathbf x_i \boldsymbol \beta| \mathbf x_i, s_i=1) + E(s_i u_i|\mathbf x_i, s_i=1) \ \ \Rightarrow $$ $$ E(y_i|\mathbf x_i, s_i=1) \, = \, \mathbf x_i \boldsymbol \beta + \underbrace {E(u_i|\mathbf x_i, s_i=1)}_{\text {should be } 0} \, = \, \mathbf x_i \boldsymbol \beta $$ - So, if the expectation of $u_i$ is not influenced by $s$, the conditional expectation of $y$ is equal to $\mathbf x_i \boldsymbol \beta$ and thus, $\boldsymbol \beta$ can be consistently estimated by OLS with the available sample - The expectation of $u_i$ is not influenced by $s$ if 1. $s$ is only a function of the exogenous $\mathbf x$ (exogenous sample selection), in which case the condition $E(u_i|\mathbf x_i, s_i=1)=0$ reduces to the usual condition $E(u_i|\mathbf x_i)=0$ 2. $s$ is purely random (random sample selection), in which case the condition $E(u_i|\mathbf x_i, s)=0$ is met 3. $s$ is a mixture of both cases above ------------------------------------------------------------------------ **Incidental truncation or sample selection** is a mechanism where sample selection is only **indirectly** linked to the dependent variable $y$ or its error $u$; that means via **other** variables. - Very often but not always, this is due to **self-selection** - The **leading example** is the following: We want to estimate the person's *market* wage what she could command in the labor market. This market wage would depend on her education, experience, age, etc. But if the wage offered by the market is too low, meaning that the market wage is lower than her *reservation* wage, she will not work at all and her *potential market wage is not observed* - This is clearly a self-section mechanism and a **first stage decision** determines whether the wage for this particular person is observed or not. The sample is *not random any more* and in particular, is **linked** to the **unobserved reservation wage** - People with low potential market wage or high reservation wage might therefore be heavily **underrepresented** in the sample - And if the factors deciding labor force participation influence the potential market wage as well we will get biased results regarding the factors determining the market wage - A **solution** to this problem is to define a separate model describing the decision for working versus non-working to get a clue whether a particular person will probably work and try to make an adjustment for the underrepresented cases - Hence, we will have **two** models, a **selection** or *participation* model and the **outcome** model (or sometimes called *intensity* model) which we are primarily interested in ------------------------------------------------------------------------ ### Heckit model The **selection model** (participation model) can be modeled as a simple *Probit* model; we want to estimate the conditions, from which on the person is expected to work (or more generally to participate). This is the case when the probability for working is larger than for non-working. Naturally, this leads to a probit model: $$ s_i^*=\mathbf z_i \boldsymbol \gamma + e_i, \ \ \ \ s_i = \begin{cases} 1 \ \text{ if } \ s_i^*>0 \\ 0 \ \text{ if } \ s_i ≤ 0 \end{cases} $$ {#eq-heckit_probit} - Here, $s_i^*$ is a latent variable, which in our example could be the *difference* between potential market wage and **reservation wage** - We predict labor market participation, $s_i=1$, if $s_i^*>0$, i.e., the market wage is larger than the reservation wage. Clearly, we predict out of labor force, $s_i=0$, if $s_i^*≤0$. In particular, $s_i=1$ if $$ e_i > - \mathbf z_i \boldsymbol \gamma $$ {#eq-heckit_probit_s1} - The vector $\mathbf z_i$ contains the explanatory variables for participation in the labor market (in our example the number of kids, income of husband, age, etc.), $\boldsymbol \gamma$ is the corresponding parameter vector to be estimated and $e_i$ represents **unobserved heterogeneity** of the **utility functions** and especially, information about the **unobserved reservation wage** ------------------------------------------------------------------------ The equation of primary interest is the **outcome equation** (sometimes called intensity equation), in our example the *wage equation*: $$ y_i^* = \mathbf x_i \boldsymbol \beta + u_i, \ \ \ y_i = \begin{cases} y_i^* \ \text{ if } \ s_i=1 \\ \text{unobserved if } \ s_i = 0 \end{cases} $$ {#eq-heckit_outcome} - Here, $y_i^*$ is the latent dependent variable, in our example the *potential* market wage, $\mathbf x_i$ are the explanatory variables and $u_i$ represents unobserved heterogeneity of the individuals which too affects the market wage - The *observation rule* says that $y_i$ is only observed if $s_i=1$, i.e. $s_i^*>0$, which means participation in the labor market in our example. The value of $s_i$ is determined in the first stage decision problem, the selection equation - Note, we assume that $\mathbf z_i$ *is observed* for all $i$ in the sample, as otherwise we couldn't estimate the selection problem - According to the selection model (@eq-heckit_probit), besides $\mathbf z_i$, the unobserved error $e_i$ determines whether the market wage for person $i$ is observed or not. Therefore, we have an **incidental sample selection** because these factor are not **directly** linked to $y_i$ or $u_i$ ------------------------------------------------------------------------ We **assume** that the errors from the section and outcome equations, $e$ and $u$, are **jointly** *normal distributed* with **correlation** $\rho$. This means that the joint distribution is solely described by the mean vector and the covariance matrix and thus *rules out* any **nonlinear** relationships between $e$ and $u$: $$ \binom{e}{u} \ \sim \ N\left ( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 1 & \rho \, \sigma_u \\ \rho \, \sigma_u & \sigma_u^2 \end{bmatrix} \right ) $$ {#eq-bi_normal_distrib} - Note, $\sigma_e^2 = 1$ is no restriction of generality as this variance is not identified in a Probit model and thus can be arbitrarily normalized (we already discussed this point in the chapter on binary choice models) - We **assume**, that $e_i$ as well as $u_i$ are independent of $(\mathbf x_i, \mathbf z_i)$, thus $E(u_i | \mathbf x_i, \mathbf z_i)=0$ and $E(e_i | \mathbf x_i, \mathbf z_i)=0$, MLR.4 (see @sec-MLR) - Whether or not OLS applied to the wage equation leads to consistent and asymptotically normal distributed estimates of $\boldsymbol \beta$ depends on whether the **selection mechanism** via $e_i$ is **exogenous** (uncorrelated) with respect the error term $u_i$ (and thus with $y_i$) - Therefore, the **correlation coefficient** $\rho$ is crucial for this problem. If $\rho \neq 0$, a **sample selection bias** arises ------------------------------------------------------------------------ To show the importance of $\rho$ we decompose $u_i$ into its conditional expectation on $e_i$ and an error term $\epsilon_i$, which is **uncorrelated** with $e_i$ by construction and uncorrelated with $\mathbf x_i, \mathbf z_i$ by assumption (as $u_i$ is independent of $(\mathbf x_i, \mathbf z_i)$) $$ u_i \, = \, E(u_i|e_i) + \epsilon_i, \quad \text{with } \, E(\epsilon_i|\mathbf x_i, \mathbf z_, e_i)=0 $$ - Because $(u_i,e_i)$ are jointly normal, $E(u_i|e_i)$ is equal to the *linear projection* (regression) of $u_i$ on $e_i$, with the projection coefficient $\sigma_{u,e}/\sigma_e^2$ (with $\sigma_{u,e}$ = covariance of $u$ and $e$) - This reduces to $\sigma_{u,e}$, as $\sigma_e^2$ is normalized to one. Considering the definition of the correlation coefficient, $\rho := \sigma_{u,e}/(\sigma_u \sigma_e)$, we get $$ u_i \, = \, \sigma_{u,e}e_i + \epsilon_i \, = \, \rho \, \sigma_u e_i + \epsilon_i $$ - Therefore, the outcome @eq-heckit_outcome can be rewritten as $$ y_i^* = \mathbf x_i \boldsymbol \beta + \rho \, \sigma_u e_i + \epsilon_i $$ {#eq-heckit_outcome1} - Taking the conditional expectation of @eq-heckit_outcome1 for the **observed sample** ($s_i=1 \ \Rightarrow \ e_i>-\boldsymbol \gamma \mathbf z \,$) we get (note, $e_i \perp (\mathbf x_i, \mathbf z_i)$ by assumption) $$ E(y_i^*|\mathbf x_i, \mathbf z_i, s_i=1) \ = \ \mathbf x_i \boldsymbol \beta \, + \, \rho \sigma_u E(e_i|s_i=1) \, + \, \underbrace {E(\epsilon_i|\mathbf x_i, \mathbf z_i, \overbrace {s_i=1}^{e_i>-\boldsymbol \gamma \mathbf z})}_{= \,0, \text{ as } \epsilon_i \, \perp \, (\mathbf x_i, \mathbf z_i,e_i) } $$ ------------------------------------------------------------------------ According to the selection model (@eq-heckit_probit), the term before the last is: $$ \rho \sigma_u E(e_i|s_i=1) \, = \, \rho \sigma_u E(e_i|s_i^*>0) \, = \, \rho \sigma_u E(e_i|e_i>-\mathbf z_i \boldsymbol \gamma) $$ - But this is the expectation of a truncated standardized normal distribution and hence, we can apply the theorem about truncated normal distributions, @thm-trunc_normal_distrib (with $y=e$, $\mu=0$, $c=-\mathbf z_i \boldsymbol \gamma$ and $\sigma_{(e)}=1$) $$ \rho \sigma_u E(e_i|e_i>-\mathbf z_i \boldsymbol \gamma) \, = \, \rho \sigma_u \lambda(\mathbf z_i \boldsymbol \gamma) $$ with $\lambda(\mathbf z_i \boldsymbol \gamma)$ being the *inverse Mills Ratio* (see definition in @eq-inv_mills_ratio)\ Putting things together, we finally arrive to the *conditional expectation of observed* $y_i$, which is the population regression in this case: > $$ > E(y_i|\mathbf x_i, \mathbf z_i, s_i=1) \, = \, \mathbf x_i \boldsymbol \beta + \rho \sigma_u \lambda(\mathbf z_i \boldsymbol \gamma) > $$ {#eq-heckit_E} - Note, the similarities to @eq-trunc_E, which is not accidental as this too is a case of a truncated regression problem ------------------------------------------------------------------------ Regarding to our **leading example**, if the correlation of *unobserved* wage determinants and *unobserved* determinants of the work decision is positive ($\rho>0$), the women observed working will have higher wages than expected solely from their characteristics (= positive selection). And this effect is represented by $\lambda(\mathbf z_i \boldsymbol \gamma)$, which is strictly positive - Furthermore, the correction factor $\lambda(\mathbf z_i \boldsymbol \gamma)$ will be **higher** for observations with a **low probability** to be in the sample (because of a low $\mathbf z_i \boldsymbol \gamma, \ \lambda'< 0$), but are nonetheless in the sample because of a high $e_i,$ which represents **unobserved heterogeneity** and is possibly correlates with error $u_i$ As @eq-heckit_E show, if $\rho \neq 0$, regressing $y_i$ solely on $\mathbf x_i$ **leads to a sample selection bias** in form of an **omitted variable bias**; the omitted variable $\lambda(\mathbf z_i \boldsymbol \gamma )$ is a nonlinear function of $\mathbf z_i$ and therefore probably correlated with $\mathbf x_i$ (as $\mathbf x_i$ is usually a part of $\mathbf z_i$) - In this case, there are **unobserved factors** affecting **both** the selection as well as the outcome equation. This is the cause of this type of sample selection bias Finally, let us *discuss* another aspect more closely by means of our example. We argued that $s_i^*$ in the selection equation could be interpreted as the difference between between market wage offers and reservation wage, $(w_i - w_i^r)$. But according to the outcome @eq-heckit_outcome, the market wage offers are dependent on $\mathbf x_i$ and on $u_i$. This implies two things 1. Thus, the variables in the outcome equation $\mathbf x_i$ obviously plays some role in the selection equation (via market wage offers). Therefore, the variables $\mathbf x_i$ **should be part of the variables in** $\mathbf z_i$, the variables of the selection equation. And this inclusion of *all* $\mathbf x$ in $\mathbf z$ is generally **recommended** as **standard procedure** 2. A very similar argument can be put forward for the **unobserved factors** of the outcome equation, $u_i$. These too influence the market wage offers and therefore the **decision to work or not**. Thus, these factors are also a part of $e_i$, the **unobserved factors** of the selection equation. Hence, the error terms $u_i$ and $e_i$ are **expected to be positively correlated** in our example, which implies $\rho > 0$ ------------------------------------------------------------------------ #### Estimation The model introduced above is called **Hackman** model or **Heckit**, @heckman1979sample. It is a generalization of the Tobit model. In fact, if you set $\mathbf x_i = \mathbf z_i$, $u_i = e_i$ and censoring $y_i$ to zero for cases not observing $y_i$, we arrive to the Tobit model. For this reason the Hackman model is also sometimes called Tobit II model. - There are *two* procedures to **estimate** the model, a 2-*step procedure* and *maximum likelihood estimation*. Surprisingly, the **2-step procedure** is much more common in practice. This procedure is as follows 1. Estimate the selection model (@eq-heckit_probit) by probit. This determines if the dependent variable is expected to be observed. The probability for $y_i$ to be observed is $$ P(s_i=1|\mathbf z_i) = \Phi(\mathbf z_i \boldsymbol \gamma) \ \rightarrow \ \hat {\boldsymbol \gamma} $$ 2. Calculate the inverse Mills ratio using the estimated probit coefficients for every observation $$ \hat\lambda({\mathbf z}_i \hat{\boldsymbol \gamma}) = \dfrac {\phi({\mathbf z}_i \hat{\boldsymbol \gamma})}{\Phi({\mathbf z}_i \hat{\boldsymbol \gamma})} $$ 3. Estimate the outcome @eq-heckit_outcome, augmented by the estimated inverse Mills Ratio as a correction factor, by OLS. **This yields consistent estimates of** $\boldsymbol \beta$. The coefficient $\alpha$ is an estimate for $\rho \sigma_u$ of @eq-heckit_E $$ y_i \, = \, \mathbf x_i \boldsymbol \beta \, + \, \alpha \, \hat\lambda(\mathbf z_i \hat{\boldsymbol \gamma}) \, + \, error $$ {#eq-heckit_regression} 4. A complication arises with the calculation of correct standard errors for regression @eq-heckit_regression. Besides an inherent heteroscedasticity of the error, one have to take into account that $\hat \lambda$ is itself estimated with the help of the estimated parameters $\hat {\boldsymbol \gamma}$ of the selection equation, which introduces additional variation and correlations across observations. For details see @green2017, chapt. 19, p. 994 ------------------------------------------------------------------------ The whole procedure sounds a little bit complicated, but there are standard routines for both STATA and R. The corresponding R package is `sampleSelection` with the procedure `heckit.` These routines also calculate the correct standard errors for the 2-step procedure. - The estimated coefficient $\hat \alpha$ of $\hat \lambda$ in @eq-heckit_regression can be used to **test for a sample selection bias**. If this parameter is not statically significant different from zero, then there seems to be no sample selection bias as $\hat \alpha$ is an estimate for $\rho \sigma_u$ hence, $\hat \alpha = 0 \Rightarrow \rho = 0$ - However, this whole 2-step procedure and testing for a sample selection should be carried out with great care. The reason is that there is an **inherent identification problem**; the variable $\lambda$ is a nonlinear function (but over a wide range nearly a linear function) of the variables in $\mathbf z$. As we learned above, $\mathbf x$ is generally contained in $\mathbf z$. Thus, we often have a server multicollinearity problem in the estimated outcome @eq-heckit_regression, leading to very imprecise parameter estimates and powerless tests - To reduce the identification problem, we **always** should have variables in $\mathbf z$, which are not contained in $\mathbf x$, helping to identify the coefficients of $\mathbf x$ in @eq-heckit_regression -- in this case, $\hat \lambda$ is not any more a nearly linear combination of the variables in $\mathbf x$ - In practice, it is often hard to find variables that play a role in the selection model but **credibly** do not play a role in the outcome model. Erroneously omitting variables form the outcome model can lead to severe biases ------------------------------------------------------------------------ #### Example We argued, that $\mathbf z$ should contain at least one variable which is not contained in $\mathbf x$. Let us discuss this aspect by means of **our example**, which is estimated below. - We are **interested in the market wage offers of married woman**. We assume that education and experience play an important role in this respect. So, this is our *outcome equation* - To avoid a sample selection problem (because we only observe wages of employed woman), we apply Heckman's 2-step procedure. Thus, we additionally specify a *selection equation* for the decision to work, to estimate the correction factor $\lambda$. We argued above, that all variables of the outcome equation should be included in the selection equation because possible wage offers are clearly an important factor for the decision to work as well - Therefore, the selection model contains at least education and experience as explanatory variables. However, to better identify the parameters of the outcome equation, we **additionally** need variables that are important for the decision to work and **do not play** a role for the market wage offers - In this case, we find such variables; we added the non-wife income (income of the husband), the number of children under six and over six and age as explanatory variables to the selection model. These variables should have an effect on the reservation wage and therefore on the decision to work, but not on the wage offers in the market ------------------------------------------------------------------------ ```{r} #| comment: " " library(sampleSelection) library(wooldridge) data("mroz") # OLS estimation ols <- lm(log(wage) ~ educ + exper + I(exper^2), data = mroz) # Heckit 2 Step model (default); the first line is the selection equation, # the second line the outcome equation heck <- heckit(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, log(wage) ~ educ + exper + I(exper^2), data = mroz) summary (heck) # Heckit MLE model heckml <- heckit(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, log(wage) ~ educ + exper + I(exper^2), data = mroz, method = "ML") ``` ------------------------------------------------------------------------ ```{r} #| comment: " " #| code-fold: true library(texreg) screenreg( list(ols, heck, heckml), digits = 4, custom.model.names = c("OLS", "Heckit 2Stp", "Heckit MLE"), prefix = FALSE, include.selection = FALSE, include.rsquared=FALSE ) ``` ------------------------------------------------------------------------ - In our example, the coefficient of $\lambda$, the inverse Mills Ratio, is not significant. Thus, there is no significant sample selection bias. This is the reason, why the estimates of OLS and Heckit are very similar - In the table, the results of a *maximum likelihood estimation* are also reported. As mentioned above, the 2-step procedure is much more common in empirical work, although the MLE is more efficient, especially with regard to an estimate of $\rho$. The downside of MLE is that this is said to be more prone to specification errors, for instance with regard to the normal distribution assumption of the error terms - The likelihood function of the Heckman model consists of two parts: $P(s_i=0)$, this is the Probit part, and the density of $y_i$, given $s_i=1$ times $P(s_i=1)$ $$ \ln L = \sum_{s_i=0} \ln (1 - \Phi(\mathbf z_i \boldsymbol \gamma)) + \sum_{s_i=1} \ln f(y_i|s_i=1) \Phi(\mathbf z_i \boldsymbol \gamma) $$ As we have an incidental truncation, $f(y_i|s_i=1) = f(y_i|e_i>-\mathbf z_i \boldsymbol \gamma)$ is a more complicated term based on a *bivariate* truncated normal distribution (see @green2017, chapt.19, p. 995) - **Last but not least**, it should be noted that many of the methods for limited depended variables and sample selection discussed can be extended to 2SLS and some also to panel data (for details see @green2017)