2021-10-29
We say that there is endogeneity in the linear regression model if \(\mathbb E[x_i \varepsilon_i] \neq 0\).
The random vector \(z_i\) is an instrumental variable in the linear regression model if the following conditions are met.
Let \(K = dim(x_i)\) and \(L = dim(z_i)\). We say that the model is just-identified if \(L = K\) (method: IV) and over-identified if \(L > K\) (method: 2SLS).
Assume \(z_i\) satisfies the instrumental variable assumptions above and \(dim(z_i) = dim(x_i)\), then the instrumental variables (IV) estimator \(\hat{\beta} _ {IV}\) is given by \[ \begin{aligned} \hat{\beta} _ {IV} &= \mathbb E_n[z_i x_i']^{-1} \mathbb E_n[z_i y_i] = \newline &= \left( \frac{1}{n} \sum _ {i=1}^n z_i x_i\right)^{-1} \left( \frac{1}{n} \sum _ {i=1}^n z_i y_i\right) = \newline &= (Z'X)^{-1} (Z'y) \end{aligned} \]
Assume \(z_i\) satisfies the instrumental variable assumptions above and \(dim(z_i) > dim(x_i)\), then the two-stage-least squares (2SLS) estimator \(\hat{\beta} _ {2SLS}\) is given by \[ \hat{\beta} _ {2SLS} = \Big( X'Z (Z'Z)^{-1} Z'X \Big)^{-1} \Big( X'Z (Z'Z)^{-1} Z'y \Big) \] Where \(\hat{x}_i\) is the predicted \(x_i\) from the first stage regression of \(x_i\) on \(z_i\). This is equivalent to the IV estimator using \(\hat{x}_i\) as an instrument for \(x_i\).
The estimator is called two-stage-least squares since it can be rewritten as an IV estimator that uses \(\hat{X}\) as instrument: \[ \begin{aligned} \hat{\beta} _ {\text{2SLS}} &= \Big( X'Z (Z'Z)^{-1} Z'X \Big)^{-1} \Big( X'Z (Z'Z)^{-1} Z'y \Big) = \newline &= (\hat{X}' X)^{-1} \hat{X}' y = \newline &= \mathbb E_n[\hat{x}_i x_i']^{-1} \mathbb E_n[\hat{x}_i y_i] \end{aligned} \]
Moreover it can be rewritten as \[ \begin{aligned} \hat{\beta} _ {\text{2SLS}} &= (\hat{X}' X)^{-1} \hat{X}' y = \newline &= (X' P_Z X)^{-1} X' P_Z y = \newline &= (X' P_Z P_Z X)^{-1} X' P_Z y = \newline &= (\hat{X}' \hat{X})^{-1} \hat{X}' y = \newline &= \mathbb E_n [\hat{x}_i \hat{x}_i]^{-1} \mathbb E_n[\hat{x}_i y_i] \end{aligned} \]
How to the test the relevance condition? Rule of thumb: \(F\)-test in the first stage \(>10\) (joint test on \(z_i\)).
Problem: as \(n \to \infty\), with finite \(L\), \(F \to \infty\) (bad rule of thumb).
Theorem
If \(K=L\), \(\hat{\beta} _ {\text{2SLS}} = \hat{\beta} _ {\text{IV}}\).
Proof
If \(K=L\), \(X'Z\) and \(Z'X\) are squared matrices and, by the relevance condition, non-singular (invertible). \[ \begin{aligned} \hat{\beta} _ {\text{2SLS}} &= \Big( X'Z (Z'Z)^{-1} Z'X \Big)^{-1} \Big( X'Z (Z'Z)^{-1} Z'y \Big) = \newline &= (Z'X)^{-1} (Z'Z) (X'Z)^{-1} X'Z (Z'Z)^{-1} Z'y = \newline &= (Z'X)^{-1} (Z'Z) (Z'Z)^{-1} Z'y = \newline &= (Z'X)^{-1} (Z'y) = \newline &= \hat{\beta} _ {\text{IV}} \end{aligned} \] \[\tag*{$\blacksquare$}\]
Example from Hayiashi (2000) page 187: demand and supply simultaneous equations. \[ \begin{aligned} & q_i^D(p_i) = \alpha_0 + \alpha_1 p_i + u_i \newline & q_i^S(p_i) = \beta_0 + \beta_1 p_i + v_i \end{aligned} \]
We have an endogeneity problem. To see why, we solve the system of equations for \((p_i, q_i)\): \[ \begin{aligned} & p_i = \frac{\beta_0 - \alpha_0}{\alpha_1 - \beta_1} + \frac{v_i - u_i}{\alpha_1 - \beta_1 } \newline & q_i = \frac{\alpha_1\beta_0 - \alpha_0 \beta_1}{\alpha_1 - \beta_1} + \frac{\alpha_1 v_i - \beta_1 u_i}{\alpha_1 - \beta_1 } \end{aligned} \]
Then the price variable is not independent from the error term in neither equation: \[ \begin{aligned} & Cov(p_i, u_i) = - \frac{Var(u_i)}{\alpha_1 - \beta_1 } \newline & Cov(p_i, v_i) = \frac{Var(v_i)}{\alpha_1 - \beta_1 } \end{aligned} \]
As a consequence, the OLS estimators are not consistent: \[ \begin{aligned} & \hat{\alpha} _ {1, OLS} \overset{p}{\to} \alpha_1 + \frac{Cov(p_i, u_i)}{Var(p_i)} \newline & \hat{\beta} _ {1, OLS} \overset{p}{\to} \beta_1 + \frac{Cov(p_i, v_i)}{Var(p_i)} \end{aligned} \]
In general, running regressing \(q\) on \(p\) you estimate \[ \begin{aligned} \hat{\gamma} _ {OLS} &\overset{p}{\to} \frac{Cov(p_i, q_i)}{Var(p_i)} = \newline &= \frac{\alpha_1 Var(v_i) + \beta_1 Var(u_i)}{(\alpha_1 - \beta_1)^2} \left( \frac{Var(v_i) + Var(u_i)}{(\alpha_1 - \beta_1)^2} \right)^{-1} = \newline &= \frac{\alpha_1 Var(v_i) + \beta_1 Var(u_i)}{Var(v_i) + Var(u_i)} \end{aligned} \] Which is neither \(\alpha_1\) nor \(\beta_1\) but a variance weighted average of the two.
Suppose we have a supply shifter \(z_i\) such that
We combine the second condition and \(\mathbb E[u_i] = 0\) to get a system of 2 equations in 2 unknowns: \(\alpha_0\) and \(\alpha_1\). \[ \begin{aligned} & \mathbb E[z_i u_i] = \mathbb E[ z_i (q_i^D(p_i) - \alpha_0 - \alpha_1 p_i) ] = 0 \newline & \mathbb E[u_i] = \mathbb E[q_i^D(p_i) - \alpha_0 - \alpha_1 p_i] = 0 \end{aligned} \]
We could try to solve for the vector \(\alpha\) that solves \[ \begin{aligned} & \mathbb E_n[z_i (q_i^D - x_i\alpha)] = 0 \newline & \mathbb E_n[z_i q_i^D] - \mathbb E_n[z_ix_i\alpha] = 0 \end{aligned} \]
If \(\mathbb E_n[z_ix_i]\) is invertible, we get \(\hat{\alpha} = \mathbb E_n[z_ix_i]^{-1} \mathbb E_n[z_i q^D_i]\) which is indeed the IV estimator of \(\alpha\) using \(z_i\) as an instrument for the endogenous variable \(p_i\).
This code draws 100 observations from the model \(y = 2 x_1 - x_2 + \varepsilon\) where \(x_1, x_2 \sim U[0,1]\) and \(\varepsilon \sim N(0,1)\).
# Set seed Random.seed!(123); # Set the number of observations n = 100; # Set the dimension of Z l = 3; # Draw instruments Z = rand(Uniform(0,1), n, l); # Correlation matrix for error terms S = [1 0.8; 0.8 1]; # Endogenous X γ = [2 0; 0 -1; -1 3]; ε = rand(Normal(0,1), n, 2) * cholesky(S).U; X = Z*γ .+ ε[:,1]; # Calculate y y = X*β .+ ε[:,2];
# Estimate beta OLS β_OLS = (X'*X)\(X'*y)
## 2-element Array{Float64,1}: ## 2.335699233358403 ## -0.8576266209987325
# IV: l=k=2 instruments Z_IV = Z[:,1:k]; β_IV = (Z_IV'*X)\(Z_IV'*y)
## 2-element Array{Float64,1}: ## 1.6133344277861439 ## -0.6678537395714547
# Calculate standard errors ε_hat = y - X*β_IV; V_NHC_IV = var(ε_hat) * inv(Z_IV'*X)*Z_IV'*Z_IV*inv(Z_IV'*X); V_HC0_IV = inv(Z_IV'*X)*Z_IV' * (I(n) .* ε_hat.^2) * Z_IV*inv(Z_IV'*X);
# 2SLS: l=3 instruments Pz = Z*inv(Z'*Z)*Z'; β_2SLS = (X'*Pz*X)\(X'*Pz*y)
## 2-element Array{Float64,1}: ## 1.904553638377971 ## -0.8810907510370429
# Calculate standard errors ε_hat = y - X*β_2SLS; V_NCH_2SLS = var(ε_hat) * inv(X'*Pz*X); V_HC0_2SLS = inv(X'*Pz*X)*X'*Pz * (I(n) .* ε_hat.^2) *Pz*X*inv(X'*Pz*X);
We have a system of \(L\) moment conditions \[ \begin{aligned} & \mathbb E[g_1(\omega_i, \delta_0)] = 0 \newline & \vdots \newline & \mathbb E[g_L(\omega_i, \delta_0)] = 0 \end{aligned} \]
If \(L = \dim (\delta_0)\), no problem. If \(L > \dim (\delta_0)\), there may be no solution to the system of equations.
There are two possibilities.
The choice of \(A\) and \(W\) are closely related to each other.
Since \(J(\delta,W)\) is a quadratic form, a closed form solution exists: \[ \hat{\delta}(W) = \Big(\mathbb E_n[z_i x_i'] W \mathbb E_n[z_i x_i'] \Big)^{-1}\mathbb E_n[z_i x_i'] W \mathbb E_n[z_i y_i] \]
Assumptions for consistency of the GMM estimator given data \(\mathcal D = \lbrace y_i, x_i, z_i \rbrace _ {i=1}^n\):
Theorem
Under linearity, independence, orthogonality and rank conditions, if \(\hat{W} \overset{p}{\to} W\) positive definite, then \[ \hat{\delta}(\hat{W}) \to \delta(W) \] If in addition to the above assumption, \(\sqrt{n} \mathbb E_n [g(\omega_i, \delta_0)] \overset{d}{\to} N(0,S)\) for a fixed positive definite \(S\), then \[ \sqrt{n} (\hat{\delta} (\hat{W}) - \delta(W)) \overset{d}{\to} N(0,V) \] where \(V = (\Sigma' _ {xz} W \Sigma _ {xz})^{-1} \Sigma _ {xz} W S W \Sigma _ {xz}(\Sigma' _ {xz} W \Sigma _ {xz})^{-1}\).
Finally, if a consistent estimator \(\hat{S}\) of \(S\) is available, then using sample analogues \(\hat{\Sigma}_{xz}\) it follows that \[ \hat{V} \overset{p}{\to} V \]
If \(W = S^{-1}\) then \(V\) reduces to \(V = (\Sigma' _ {xz} W \Sigma _ {xz})^{-1}\). Moreover, \((\Sigma' _ {xz} W \Sigma _ {xz})^{-1}\) is the smallest possible form of \(V\), in a positive definite sense.
Therefore, to have an efficient estimator, you want to construct \(\hat{W}\) such that \(\hat{W} \overset{p}{\to} S^{-1}\).
Estimation steps:
On the procedure:
- This estimator achieves the semiparametric efficiency bound.
- This strategy works only if \(\hat{S} \overset{p}{\to} S\) exists.
- For iid cases: we can use \(\hat{\delta} = \mathbb E_n[(\hat{\varepsilon}_i z_i)(\hat{\varepsilon}_i z_i) ' ]\) where \(\hat{\varepsilon}_i = y_i - x_i \hat{\delta}(\hat{W} _ {init})\).
# GMM 1-step: inefficient weighting matrix W_1 = I(l); # Objective function gmm_1(b) = ( y - X*b )' * Z * W_1 * Z' * ( y - X*b ); # Estimate GMM β_gmm_1 = optimize(gmm_1, β_OLS).minimizer
## 2-element Array{Float64,1}: ## 1.91556882526808 ## -0.8769689391885799
# Standard errors GMM ε_hat = y - X*β_gmm_1; S_hat = Z' * (I(n) .* ε_hat.^2) * Z; d_hat = -X'*Z; V_gmm_1 = inv(d_hat * inv(S_hat) * d_hat')
## 2×2 Array{Float64,2}: ## 0.0158497 -0.00346601 ## -0.00346601 0.00616531
# GMM 2-step: efficient weighting matrix W_2 = inv(S_hat); # Objective function gmm_2(b) = ( y - X*b )' * Z * W_2 * Z' * ( y - X*b ); # Estimate GMM β_gmm_2 = optimize(gmm_2, β_OLS).minimizer
## 2-element Array{Float64,1}: ## 1.905326742963115 ## -0.881808949213345
# Standard errors GMM ε_hat = y - X*β_gmm_2; S_hat = Z' * (I(n) .* ε_hat.^2) * Z; d_hat = -X'*Z; V_gmm_2 = inv(d_hat * inv(S_hat) * d_hat')
## 2×2 Array{Float64,2}: ## 0.0162603 -0.00357632 ## -0.00357632 0.00631259
If the equations are exactly identified, then it is possible to choose \(\delta\) so that all the elements of the sample moments \(\mathbb E_n[g(\omega_i; \delta)]\) are zero and thus that the distance \[ J(\delta, \hat{W}) = n \mathbb E_n[g(\omega_i, \delta)]' \hat{W} \mathbb E_n[g(\omega_i, \delta)] \] is zero. (The \(\delta\) that does it is the IV estimator.)
If the equations are overidentified, i.e. \(L\) (number of instruments) \(> K\) (number of equations), then the distance cannot be zero exactly in general, but we would expect the minimized distance to be close to zero.
Suppose your model is overidentified (\(L > K\)) and you use the following naive testing procedure:
If you have two invalid instruments and you use one to test the validity of the other, it might happen by chance that you don’t reject it.
Model: \(y_i = x_i + \varepsilon_i\) and \(x_i = \frac{1}{2} z _{i1} - \frac{1}{2} z _{i2} + u_i\)
Have \[ Cov (z _{i1}, z _{i2}, \varepsilon_i, u_i) = \begin{bmatrix} 1 & 0 & 0 & 0 \newline 0 & 1 & 0 & 0 \newline 0 & 0 & 1 & 0.5 \newline 0 & 0 & 0.5 & 1 \end{bmatrix} \]
You want to test whether the second instrument is valid (is not since \(\mathbb E[z_2 \varepsilon] \neq 0\)). You use \(z_1\) and estimate \(\hat{\beta} \to\) the estimator is consistent.
You obtain \(\mathbb E_n[z _{i2} \hat{\varepsilon}_i] \simeq 0\) even if \(z_2\) is invalid
Problem: you are using an invalid instrument in the first place.
Theorem: We are interested in testing \(H_0: \mathbb E[z_i \varepsilon_i] = 0\) against \(H_1: \mathbb E[z_i \varepsilon_i] \neq 0\). Suppose \(\hat{S} \overset{p}{\to} S\). Then \[ J(\hat{\delta}(\hat{S}^{-1}) , \hat{S}^{-1}) \overset{d}{\to} \chi^2 _ {L-K} \] For \(c\) satisfying \(\alpha = 1- G_{L - K} ( c )\), \(\Pr(J>c | H_0) \to \alpha\) so the test reject \(H_0\) if \(J > c\) has asymptotic size \(\alpha\).
The main implication of conditional homoskedasticity is that efficient GMM becomes 2SLS. With efficient GMM estimation, the weighting matrix is \(\hat{S}^{-1} = \mathbb En [z_i z_i' \varepsilon_i^2]^{-1}\). With conditional homoskedasticity, the efficient weighting matrix is \(\mathbb E_n[z_iz_i']^{-1} \sigma^{-2}\), or equivalently \(\mathbb E_n[z_iz_i']^{-1}\). Then, the GMM estimator becomes \[ \hat{\delta}(\hat{S}^{-1}) = \Big(\mathbb E_n[z_i x_i']' \underbrace{\mathbb E_n[z_iz_i']^{-1} \mathbb E[z_i x_i']} _ {\text{ols of } x_i \text{ on }z_i} \Big)^{-1}\mathbb E_n[z_i x_i']' \underbrace{\mathbb E_n[z_iz_i']^{-1} \mathbb E[z_i y_i']} _ {\text{ols of } y_i \text{ on }z_i}= \hat{\delta} _ {2SLS} \]
Proof: Consider the matrix notation. \[ \begin{aligned} \hat{\delta} \left( \frac{Z'Z}{n}\right) &= \left( \frac{X'Z}{n} \left( \frac{Z'Z}{n}\right)^{-1} \frac{Z'X}{n} \right)^{-1} \frac{X'Z}{n} \left( \frac{Z'Z}{n}\right)^{-1} \frac{Z'Y}{n} = \newline &= \left( X'Z(Z'Z)^{-1} Z'X \right)^{-1} X'Z(Z'Z)^{-1} Z'Y = \newline &= \left(X'P_ZX\right)^{-1} X'P_ZY = \newline &= \left(X'P_ZP_ZX\right)^{-1} X'P_ZY = \newline &= \left(\hat{X}'_Z \hat{X}_Z\right)^{-1} \hat{X}'_ZY = \newline &= \hat{\delta} _ {2SLS} \end{aligned} \] \[\tag*{$\blacksquare$}\]
Theorem: When the number of instruments is equal to the sample size (\(L = n\)), then \(\hat{\delta} _ {2SLS} = \hat{\delta} _ {OLS}\)
Proof: We have a perfect prediction problem. The first stage estimated coefficient \(\hat{\gamma}\) is such that it solves the normal equations: \(\hat{\gamma} = z_i^{-1} x_i\). Then \[ \begin{aligned} \hat{\delta} _ {2SLS} &= \mathbb E_n[\hat{x}_i x'_i]^{-1} \mathbb E_n[\hat{x}_i y_i] = \newline &= \mathbb E_n[z_i z_i^{-1} x_i x'_i]^{-1} \mathbb E_n[z_i z_i^{-1} x_i y_i] = \newline &= \mathbb E_n[x_i x'_i]^{-1} \mathbb E_n[x_i y_i] = \newline &= \hat{\delta} _ {OLS} \end{aligned} \] \[\tag*{$\blacksquare$}\]
You have this overfitting problem in general when the number of instruments is large relative to the sample size. This problem arises even if the instruments are valid.
Why having too many instruments is problematic? As the number of instruments increases, the estimated coefficient gets closer to OLS which is biased. As seen in the theorem above, for \(L=n\), the two estimators coincide.
An alternative method to estimate the parameters of the structural equation is by maximum likelihood. Anderson and Rubin (1949) derived the maximum likelihood estimator for the joint distribution of \((y_i, x_i)\). The estimator is known as limited information maximum likelihood, or LIML.
This estimator is called “limited information” because it is based on the structural equation for \((y_i, x_i)\) combined with the reduced form equation for \(x_i\). If maximum likelihood is derived based on a structural equation for \(x_i\) as well, then this leads to what is known as full information maximum likelihood (FIML). The advantage of the LIML approach relative to FIML is that the former does not require a structural model for \(x_i\), and thus allows the researcher to focus on the structural equation of interest - that for \(y_i\).
The k-class estimators have the form \[ \hat{\delta}(\alpha) = (X' P_Z X - \alpha X' X)^{-1} (X' P_Z Y - \alpha X' Y) \]
The limited information maximum likelihood estimator LIML is the k-class estimator \(\hat{\delta}(\alpha)\) where \[ \alpha = \lambda_{min} \Big( ([X' , Y]^{-1} [X' , Y])^{-1} [X' , Y]^{-1} P_Z [X' , Y] \Big) \]
If \(\alpha = 0\) then \(\hat{\delta} _ {\text{LIML}} = \hat{\delta} _ {\text{2SLS}}\) while for \(\alpha \to \infty\), \(\hat{\delta} _ {\text{LIML}} \to \hat{\delta} _ {\text{OLS}}\).
Asymptotically the LIML estimator has the same distribution as 2SLS. However, they can have quite different behaviors in finite samples. There is considerable evidence that the LIML estimator has superior finite sample performance to 2SLS when there are many instruments or the reduced form is weak. However, on the other hand there is worry that since the LIML estimator is derived under normality it may not be robust in non-normal settings.
The Jacknife IV procedure is the following
Here we consider testing the validity of OLS. OLS is generally preferred to IV in terms of precision. Many researchers only doubt the (joint) validity of the regressor \(z_i\) instead of being certain that it is invalid (in the sense of not being predetermined). So then they wish to choose between OLS and 2SLS, assuming that they have an instrument vector \(x_i\) whose validity is not in question. Further, assume for simplicity that \(L = K\) so that the efficient GMM estimator is the IV estimator.
The Hausman test statistic \[ H \equiv n (\hat{\delta} _ {IV} - \hat{\delta} _ {OLS})' [\hat{Avar} (\hat{\delta} _ {IV} - \hat{\delta} _ {OLS})]^{-1} (\hat{\delta} _ {IV} - \hat{\delta} _ {OLS}) \] is asymptotically distributed as a \(\chi^2_{L-s}\) under the null where \(s = | z_i \cup x_i |\): the number of regressors that are retained as instruments in \(x_i\).
In general, the idea of the Hausman test is the following. If you have two estimators, one which is efficient under \(H_0\) but inconsistent under \(H_1\) (in this case, OLS), and another which is consistent under \(H_1\) (in this case, IV), then construct a test as a quadratic form in the differences of the estimators. Another classic example arises in panel data with the hypothesis \(H_0\) of unconditional strict exogeneity. In that case, under \(H_0\) Random Effects estimators are efficient but under \(H_1\) they are inconsistent. Fixed Effects estimators instead are consistent under \(H_1\).
The Hausman test statistic can be used as a pretest procedure: select either OLS or IV according to the outcome of the test. Although widely used, this pretest procedure is not advisable. When the null is false, it is still possible that the test accepts the null (committing a Type 2 error). In particular, this can happen with a high probability when the sample size is small and/or when the regressor \(z_i\) is almost valid. In such an instance, estimation and also inference will be based on incorrect methods. Therefore, the overall properties of the Hausman pretest procedure are undesirable.
The Hausman test is an example of a specification test. There are many other specification tests. One could for example test for conditional homoskedasticity. Unlike for the OLS case, there does not exist a convenient test for conditional homoskedasticity for the GMM case. A test statistic that is asymptotically chi-squared under the null is available but is extremely cumbersome; see White (1982, note 2). If in doubt, it is better to use the more generally valid inference methods that allow for conditional heteroskedasticity. Similarly, there does not exist a convenient test for serial correlation for the GMM case. If in doubt, it is better to use the more generally valid inference methods that allow for serial correlation; for example, when data are collected over time (that is, time-series data).