2021-10-29
A statistical model for regression data is the Gauss Markov Model if each of its distributions satisfies the conditions
Linearity: a statistical model \(\mathcal{F}\) over data \(\mathcal{D}\) satisfies linearity if for each element of \(\mathcal{F}\), the data can be decomposed in \[ \begin{aligned} y_ i &= \beta_ 1 x _ {i1} + \dots + \beta_ k x _ {ik} + \varepsilon_ i = x_ i'\beta + \varepsilon_ i \newline \underset{n \times 1}{\vphantom{\beta_ \beta} y} &= \underset{n \times k}{\vphantom{\beta}X} \cdot \underset{k \times 1}{\beta} + \underset{n \times 1}{\vphantom{\beta}\varepsilon} \end{aligned} \]
Strict Exogeneity: \(\mathbb E [\varepsilon_i|x_1, \dots, x_n] = 0, \forall i\).
No Multicollinerity: \(\mathbb E_n [x_i x_i']\) is strictly positive definite almost surely. Equivalent to require \(rank(X)=k\) with probability \(p \to 1\). Intuition: no regressor is a linear combination of other regressors.
Spherical Error Variance: -\(\mathbb E[\varepsilon_i^2 | x] = \sigma^2 > 0, \ \forall i\) -\(\mathbb E [\varepsilon_i \varepsilon_j |x ] = 0, \ \forall\) \(1 \leq i < j \leq n\)
The Extended Gauss Markov Model also satisfies assumption
\[ Var(\varepsilon_i|x) = \mathbb E[\varepsilon_i^2|x]- \mathbb E[\varepsilon_i|x]^2 = \sigma^2 I \qquad \forall i \]
A map \(\Pi: V \to V\) is a projection if \(\Pi \circ \Pi = \Pi\).
The Gauss Markov Model assumes that the conditional expectation function (CEF) \(f(X) = \mathbb E[Y|X]\) and the linear projection \(g(X) = X \beta\) coincide.
This code draws 100 observations from the model \(y = 2 x_1 - x_2 + \varepsilon\) where \(x_1, x_2 \sim U[0,1]\) and \(\varepsilon \sim N(0,1)\).
# Set seed Random.seed!(123); # Set the number of observations n = 100; # Set the dimension of X k = 2; # Draw a sample of explanatory variables X = rand(Uniform(0,1), n, k); # Draw the error term σ = 1; ε = rand(Normal(0,1), n, 1) * sqrt(σ); # Set the parameters β = [2; -1]; # Calculate the dependent variable y = X*β + ε;
The sum of squared residuals (SSR) is given by \[ Q_n (\beta) \equiv \frac{1}{n} \sum _ {i=1}^n \left( y_i - x_i'\beta \right)^2 = \frac{1}{n} (y - X\beta)' (y - X \beta) \]
Consider a dataset \(\mathcal{D}\) and define \(Q_n(\beta) = \mathbb E_n[(y_i - x_i'\beta )^2 ]\). Then the ordinary least squares (OLS) estimator \(\hat \beta _ {OLS}\) is the value of \(\beta\) that minimizes \(Q_n(\beta)\).
When we can write \(D = (y, X)\) in matrix form, then \[ \hat \beta _ {OLS} = \arg \min_\beta \frac{1}{n} (y - X \beta)' (y - X\beta) \]
Theorem
Under the assumption that \(X\) has full rank, the OLS estimator is unique and it is determined by the normal equations. More explicitly, \(\hat \beta\) is the OLS estimate precisely when \(X'X \hat \beta = X'y\).
Proof
Taking the FOC: \[ \frac{\partial Q_n (\beta)}{\partial \beta} = -\frac{2}{n} X' y + \frac{2}{n} X'X\beta = 0 \quad \Leftrightarrow \quad X'X \beta = X'y \] Since \((X'X)^{-1}\) exists by assumption,
Finally, \(\frac{\partial^2 Q_n (\beta)}{\partial \beta \partial \beta'} = X'X/n\) is positive definite since \(X'X\) is positive semi-definite and \((X'X)^{-1}\) exists because \(X\) is full rank. Therefore, \(Q_n(\beta)\) minimized at \(\hat \beta_n\). \[\tag*{$\blacksquare$}\]
The \(k\) equations \(X'X \hat \beta = X'y\) are called normal equations.
The projection matrix is given by \(P = X(X'X)^{-1} X'\). It has the following properties: - \(PX = X\) - \(P \hat \varepsilon = 0 \quad\) (\(P\), \(\varepsilon\) orthogonal) - \(P y = X(X'X)^{-1} X'y = X\hat \beta = \hat y\) - Symmetric: \(P=P'\), Idempotent: \(PP = P\) - \(tr(P) = tr( X(X'X)^{-1} X') = tr( X'X(X'X)^{-1}) = tr(I_k) = k\) - Its diagonal elements are \(h_{ii} = x_i (X'X)^{-1} x_i'\) and are called leverage.
\(h _ {ii} \in [0,1]\) is a normalized length of the observed regressor vector \(x_i\). In the OLS regression framework it captures the relative influence of observation \(i\) on the estimated coefficient. Note that \(\sum _ n h_{ii} = k\).
The annihilator matrix is given by \(M = I_n - P\). It has the following properties: - \(MX = 0 \quad\) (\(M\), \(X\) orthogonal) - \(M \hat \varepsilon = \hat \varepsilon\) - \(M y = \hat \varepsilon\) - Symmetric: \(M=M'\), idempotent: \(MM = M\) - \(tr(M) = n - k\) - Its diagonal elements are \(1 - h_{ii} \in [0,1]\)
Then we can equivalently write \(\hat y\) (defined by stacking \(\hat y_i\) into a vector) as \(\hat y = Py\).
# Estimate beta β_hat = inv(X'*X)*(X'*y)
## 2×1 Array{Float64,2}: ## 1.8821600407711814 ## -0.9429354944506099
# Equivalent but faster formulation β_hat = (X'*X)\(X'*y)
## 2×1 Array{Float64,2}: ## 1.8821600407711816 ## -0.9429354944506098
# Even faster (but less intuitive) formulation β_hat = X\y
## 2×1 Array{Float64,2}: ## 1.8821600407711807 ## -0.9429354944506088
Generally it’s not true that \[ \hat \beta_{OLS} = \frac{Var(X)}{Cov(X,y)} \]
# Wrong formulation β_wrong = inv(cov(X)) * cov(X, y)
## 2×1 Array{Float64,2}: ## 1.8490257777704475 ## -0.9709213554007003
But it’s true if you include a constant, \(\alpha\) \[ y = \alpha + X \beta + \varepsilon \]
# Correct, with constant α = 3; y1 = α .+ X*β + ε; β_hat1 = [ones(n,1) X] \ y1
## 3×1 Array{Float64,2}: ## 3.0362313477745615 ## 1.8490257777704477 ## -0.9709213554007007
β_correct1 = inv(cov(X)) * cov(X, y1)
## 2×1 Array{Float64,2}: ## 1.8490257777704477 ## -0.9709213554007006
# Predicted y y_hat = X*β_hat; # Residuals ε_hat = y - X*β_hat; # Projection matrix P = X * inv(X'*X) * X'; # Annihilator matrix M = I - P; # Leverage h = diag(P);
The error is homoskedastic if \(\mathbb E [\varepsilon^2 | x] = \sigma^2\) does not depend on \(x\). \[ Var(\varepsilon) = I \sigma^2 = \begin{bmatrix} \sigma^2 & \dots & 0 \newline\newline\ \vdots & \ddots & \vdots \newline 0 & \dots & \sigma^2 \end{bmatrix} \]
The error is heteroskedastic if \(\mathbb E [\varepsilon^2 | x] = \sigma^2(x)\) does depend on \(x\). \[ Var(\varepsilon) = I \sigma_i^2 = \begin{bmatrix} \sigma_1^2 & \dots & 0 \newline \vdots & \ddots & \vdots \newline 0 & \dots & \sigma_n^2 \end{bmatrix} \]
The OLS residual variance can be an object of interest even in a heteroskedastic regression. Its method of moments estimator is given by \[ \hat \sigma^2 = \frac{1}{n} \sum _ {i=1}^n \hat \varepsilon_i^2 \]
Note that \(\hat \sigma^2\) can be rewritten as \[ \hat \sigma^2 = \frac{1}{n} \varepsilon' M' M \varepsilon = \frac{1}{n} tr(\varepsilon' M \varepsilon) = \frac{1}{n} tr(M \varepsilon' \varepsilon) \]
However, the method of moments estimator is a biesed estimator. In fact \[ \mathbb E[\hat \sigma^2 | X] = \frac{1}{n} \mathbb E [ tr(M \varepsilon' \varepsilon) | X] = \frac{1}{n} tr( M\mathbb E[\varepsilon' \varepsilon |X]) = \frac{1}{n} \sum _ {i=1}^n (1-h_{ii}) \sigma^2_i \]
Under conditional homoskedasticity, the above expression simplifies to \[ \mathbb E[\hat \sigma^2 | X] = \frac{1}{n} tr(M) \sigma^2 = \frac{n-k}{n} \sigma^2 \]
The OLS residual sample variance is denoted by \(s^2\) and is given by \[ s^2 = \frac{SSR}{n-k} = \frac{\hat \varepsilon'\hat \varepsilon}{n-k} = \frac{1}{n-k}\sum _ {i=1}^n \hat \varepsilon_i^2 \] Furthermore, the square root of \(s^2\), denoted \(s\), is called the standard error of the regression (SER) or the standard error of the equation (SEE). Not to be confused with other notions of standard error to be defined later in the course.
The sum of squared residuals can be rewritten as: \(SSR = \hat \varepsilon' \hat \varepsilon = \varepsilon' M \varepsilon\).
The OLS residual sample variance is an unbiased estimator of the error variance \(\sigma^2\).
Another unbiased estimator of \(\sigma^2\) is given by \[ \bar \sigma^2 = \frac{1}{n} \sum _ {i=1}^n (1-h_{ii})^{-1} \hat \varepsilon_i^2 \]
One measure of the variability of the dependent variable \(y_i\) is the sum of squares \(\sum _ {i=1}^n y_i^2 = y'y\). There is a decomposition: \[ \begin{aligned} y'y &= (\hat y + e)' (\hat y + \hat \varepsilon) \newline &= \hat y' \hat y + 2 \hat y' \hat \varepsilon + \hat \varepsilon' \hat \varepsilon e \newline &= \hat y' \hat y + 2 b'X'\hat \varepsilon + \hat \varepsilon' \hat \varepsilon \ \ (\text{since} \ \hat y = Xb) \newline &= \hat y' \hat y + \hat \varepsilon'\hat \varepsilon \ \ (\text{since} \ X'\hat \varepsilon =0) \end{aligned} \]
The uncentered \(\mathbf{R^2}\) is defined as: \[ R^2 _ {uc} \equiv 1 - \frac{\hat \varepsilon'\hat \varepsilon}{y'y} = 1 - \frac{\mathbb E_n[\hat \varepsilon_i^2]}{\mathbb E_n[y_i^2]} = \frac{ \mathbb E [\hat y_i^2]}{ \mathbb E [y_i^2]} \]
A more natural measure of variability is the sum of centered squares \(\sum _ {i=1}^n (y_i - \bar y)^2,\) where \(\bar y := \frac{1}{n}\sum _ {i=1}^n y_i\). If the regressors include a constant, it can be decomposed as \[ \sum _ {i=1}^n (y_i - \bar y)^2 = \sum _ {i=1}^n (\hat y_i - \bar y)^2 + \sum _ {i=1}^n \hat \varepsilon_i^2 \]
The coefficient of determination, \(\mathbf{R^2}\), is defined as \[ R^2 \equiv 1 - \frac{\sum _ {i=1}^n \hat \varepsilon_i^2}{\sum _ {i=1}^n (y_i - \bar y)^2 }= \frac{ \sum _ {i=1}^n (\hat y_i - \bar y)^2 } { \sum _ {i=1}^n (y_i - \bar y)^2} = \frac{\mathbb E_n[(\hat y_i - \bar y)^2]}{\mathbb E_n[(y_i - \bar y)^2]} \]
Always use the centered \(R^2\) unless you really know what you are doing.
# Biased variance estimator σ_hat = ε_hat'*ε_hat / n; # Unbiased estimator 1 σ_hat_2 = ε_hat'*ε_hat / (n-k); # Unbiased estimator 2 σ_hat_3 = mean( ε_hat.^2 ./ (1 .- h) );
# R squared - uncentered R2_uc = (y_hat'*y_hat)/ (y'*y); # R squared y_bar = mean(y); R2 = ((y_hat .- y_bar)'*(y_hat .- y_bar))/ ((y .- y_bar)'*(y .- y_bar));
Theorem
Under the GM assumptions (1)-(3), the OLS estimator is conditionally unbiased, i.e. the distribution of \(\hat \beta _ {OLS}\) is centered at \(\beta_0\): \(\mathbb E [\hat \beta | X] = \beta_0\).
Proof \[ \begin{aligned} \mathbb E [\hat \beta | X] &= \mathbb E [ (X'X)^{-1} X'y | X] = \newline &= (X'X)^{-1} X ' \mathbb E [y | X] = \newline &= (X'X)^{-1} X' \mathbb E [X \beta + \varepsilon | X] = \newline &= (X'X)^{-1} X'X \beta + (X'X)^{-1} X' \mathbb E [\varepsilon | X] = \newline &= \beta \end{aligned} \] \[\tag*{$\blacksquare$}\]
Theorem
Under the GM assumptions (1)-(3), \(Var(\hat \beta |X) = \sigma^2 (X'X)^{-1}\).
Proof: \[ \begin{aligned} Var(\hat \beta |X) &= Var( (X'X)^{-1} X'y|X) = \newline &= ((X'X)^{-1} X' ) Var(y|X) ((X'X)^{-1} X' )' = \newline &= ((X'X)^{-1} X' ) Var(X\beta + \varepsilon|X) ((X'X)^{-1} X' )' = \newline &= ((X'X)^{-1} X' ) Var(\varepsilon|X) ((X'X)^{-1} X' )' = \newline &= ((X'X)^{-1} X' ) \sigma^2 I ((X'X)^{-1} X' )' = \newline &= \sigma^2 (X'X)^{-1} \end{aligned} \] \[\tag*{$\blacksquare$}\]
Higher correlation of the \(X\) implies higher variance of the OLS estimator.
Intuition: individual observations carry less information. You are exploring a smaller region of the \(X\) space.
Theorem
Under the GM assumptions (1)-(3), \(Cov (\hat \beta, \hat \varepsilon ) = 0\).
Theorem
Under the GM assumptions (1)-(3), \(\hat \beta _ {OLS}\) is the best (most efficient) linear, unbiased estimator (BLUE), i.e., for any unbiased linear estimator \(b\): \(Var (b|X) \geq Var (\hat \beta |X)\).
Consider four steps:
\(Var(b | X) \geq Var (\hat{\beta} | X)\) is meant in a positive definite sense.
# Ideal variance of the OLS estimator var_β = σ * inv(X'*X)
## 2×2 Array{Float64,2}: ## 0.0609402 -0.0467732 ## -0.0467732 0.0656808
# Standard errors std_β = sqrt.(diag(var_β))
## 2-element Array{Float64,1}: ## 0.24686077212177054 ## 0.25628257446345265