OLS Algebra

Last updated on Oct 29, 2021

The Gauss Markov Model

Definition

A statistical model for regression data is the Gauss Markov Model if each of its distributions satisfies the conditions

  1. Linearity: a statistical model F over data D satisfies linearity if for each element of F, the data can be decomposed in yi=β1xi1++βkxik+εi=xiβ+εiββyn×1=βXn×kβk×1+βεn×1

  2. Strict Exogeneity: E[εi|x1,,xn]=0,i.

  3. No Multicollinerity: En[xixi] is strictly positive definite almost surely. Equivalent to require rank(X)=k with probability p1. Intuition: no regressor is a linear combination of other regressors.

  4. Spherical Error Variance: -E[εi2|x]=σ2>0, i -E[εiεj|x]=0,  1i<jn

The Extended Gauss Markov Model also satisfies assumption

  1. Normal error term: ε|XN(0,σ2In) and εX.

Implications

  • Note that by (2) and (4) you get homoskedasticity:

Var(εi|x)=E[εi2|x]E[εi|x]2=σ2Ii

  • Strict exogeneity is not restrictive since it is sufficient to include a constant in the regression to enforce it yi=α+xiβ+(εiα)E[εi]=Ex[E[εi|x]]=0
  • This implies E[xjkεi]=0 by the LIE.
  • These two conditions together imply Cov(xjkεi)=0.

Projection

A map Π:VV is a projection if ΠΠ=Π.

The Gauss Markov Model assumes that the conditional expectation function (CEF) f(X)=E[Y|X] and the linear projection g(X)=Xβ coincide.

Code - DGP

This code draws 100 observations from the model y=2x1x2+ε where x1,x2U[0,1] and εN(0,1).

# Set seed
Random.seed!(123);

# Set the number of observations
n = 100;

# Set the dimension of X
k = 2;

# Draw a sample of explanatory variables
X = rand(Uniform(0,1), n, k);

# Draw the error term
σ = 1;
ε = rand(Normal(0,1), n, 1) * sqrt(σ);

# Set the parameters
β = [2; -1];

# Calculate the dependent variable
y = X*β + ε;

The OLS estimator

Definition

The sum of squared residuals (SSR) is given by Qn(β)1ni=1n(yixiβ)2=1n(yXβ)(yXβ)

Consider a dataset D and define Qn(β)=En[(yixiβ)2]. Then the ordinary least squares (OLS) estimator β^OLS is the value of β that minimizes Qn(β).

When we can write D=(y,X) in matrix form, then β^OLS=argminβ1n(yXβ)(yXβ)

Derivation

Theorem

Under the assumption that X has full rank, the OLS estimator is unique and it is determined by the normal equations. More explicitly, β^ is the OLS estimate precisely when XXβ^=Xy.

Proof

Taking the FOC: Qn(β)β=2nXy+2nXXβ=0XXβ=Xy Since (XX)1 exists by assumption,

Finally, 2Qn(β)ββ=XX/n is positive definite since XX is positive semi-definite and (XX)1 exists because X is full rank. Therefore, Qn(β) minimized at β^n.

The k equations XXβ^=Xy are called normal equations.

Futher Objects

  • Fitted coefficient: β^OLS=(XX)1Xy=En[xixi]En[xiyi]
  • Fitted residual: ε^i=yixiβ^
  • Fitted value: y^i=xiβ^
  • Predicted coefficient: β^i=En[xixi]En[xiyi]
  • Prediction error: ε^i=yixiβ^i
  • Predicted value: y^i=xiβ^i

Notes on Orthogonality Conditions

  • The normal equations are equivalent to the moment condition En[xiεi]=0.
  • The algebraic result En[xiε^i]=0 is called ortogonality property of the OLS residual ε^i.
  • If we have included a constant in the regression, En[ε^i]=0.
  • E[En[xiεi]]=0 by strict exogeneity (assumed in GM), but En[xiεi]E[xiεi]=0. This is why β^OLS is just an estimate of β0.
  • Calculating OLS is like replacing the j equations E[xijεi]=0 j with En[xijεi]=0 j and forcing them to hold (remindful of GMM).

The Projection Matrix

The projection matrix is given by P=X(XX)1X. It has the following properties: - PX=X - Pε^=0 (P, ε orthogonal) - Py=X(XX)1Xy=Xβ^=y^ - Symmetric: P=P, Idempotent: PP=P - tr(P)=tr(X(XX)1X)=tr(XX(XX)1)=tr(Ik)=k - Its diagonal elements are hii=xi(XX)1xi and are called leverage.

hii[0,1] is a normalized length of the observed regressor vector xi. In the OLS regression framework it captures the relative influence of observation i on the estimated coefficient. Note that nhii=k.

The Annihilator Matrix

The annihilator matrix is given by M=InP. It has the following properties: - MX=0 (M, X orthogonal) - Mε^=ε^ - My=ε^ - Symmetric: M=M, idempotent: MM=M - tr(M)=nk - Its diagonal elements are 1hii[0,1]

Then we can equivalently write y^ (defined by stacking y^i into a vector) as y^=Py.

Estimating Beta

# Estimate beta
β_hat = inv(X'*X)*(X'*y)
## 2×1 Array{Float64,2}:
##   1.8821600407711814
##  -0.9429354944506099
# Equivalent but faster formulation
β_hat = (X'*X)\(X'*y)
## 2×1 Array{Float64,2}:
##   1.8821600407711816
##  -0.9429354944506098
# Even faster (but less intuitive) formulation
β_hat = X\y
## 2×1 Array{Float64,2}:
##   1.8821600407711807
##  -0.9429354944506088

Equivalent Formulation?

Generally it’s not true that β^OLS=Var(X)Cov(X,y)

# Wrong formulation
β_wrong = inv(cov(X)) * cov(X, y)
## 2×1 Array{Float64,2}:
##   1.8490257777704475
##  -0.9709213554007003

Equivalent Formulation (correct)

But it’s true if you include a constant, α y=α+Xβ+ε

# Correct, with constant
α = 3;
y1 = α .+ X*β + ε;
β_hat1 = [ones(n,1) X] \ y1
## 3×1 Array{Float64,2}:
##   3.0362313477745615
##   1.8490257777704477
##  -0.9709213554007007
β_correct1 = inv(cov(X)) * cov(X, y1)
## 2×1 Array{Float64,2}:
##   1.8490257777704477
##  -0.9709213554007006

Some More Objects

# Predicted y
y_hat = X*β_hat;

# Residuals
ε_hat = y - X*β_hat;

# Projection matrix
P = X * inv(X'*X) * X';

# Annihilator matrix
M = I - P;

# Leverage
h = diag(P);

OLS Residuals

Homoskedasticity

The error is homoskedastic if E[ε2|x]=σ2 does not depend on x. Var(ε)=Iσ2=[σ200σ2]

The error is heteroskedastic if E[ε2|x]=σ2(x) does depend on x. Var(ε)=Iσi2=[σ1200σn2]

Residual Variance

The OLS residual variance can be an object of interest even in a heteroskedastic regression. Its method of moments estimator is given by σ^2=1ni=1nε^i2

Note that σ^2 can be rewritten as σ^2=1nεMMε=1ntr(εMε)=1ntr(Mεε)

However, the method of moments estimator is a biesed estimator. In fact E[σ^2|X]=1nE[tr(Mεε)|X]=1ntr(ME[εε|X])=1ni=1n(1hii)σi2

Under conditional homoskedasticity, the above expression simplifies to E[σ^2|X]=1ntr(M)σ2=nknσ2

Sample Variance

The OLS residual sample variance is denoted by s2 and is given by s2=SSRnk=ε^ε^nk=1nki=1nε^i2 Furthermore, the square root of s2, denoted s, is called the standard error of the regression (SER) or the standard error of the equation (SEE). Not to be confused with other notions of standard error to be defined later in the course.

The sum of squared residuals can be rewritten as: SSR=ε^ε^=εMε.

The OLS residual sample variance is an unbiased estimator of the error variance σ2.

Another unbiased estimator of σ2 is given by σ¯2=1ni=1n(1hii)1ε^i2

Uncentered R^2

One measure of the variability of the dependent variable yi is the sum of squares i=1nyi2=yy. There is a decomposition: yy=(y^+e)(y^+ε^)=y^y^+2y^ε^+ε^ε^e=y^y^+2bXε^+ε^ε^  (since y^=Xb)=y^y^+ε^ε^  (since Xε^=0)

The uncentered R2 is defined as: Ruc21ε^ε^yy=1En[ε^i2]En[yi2]=E[y^i2]E[yi2]

Centered R^2

A more natural measure of variability is the sum of centered squares i=1n(yiy¯)2, where y¯:=1ni=1nyi. If the regressors include a constant, it can be decomposed as i=1n(yiy¯)2=i=1n(y^iy¯)2+i=1nε^i2

The coefficient of determination, R2, is defined as R21i=1nε^i2i=1n(yiy¯)2=i=1n(y^iy¯)2i=1n(yiy¯)2=En[(y^iy¯)2]En[(yiy¯)2]

Always use the centered R2 unless you really know what you are doing.

Code - Variance

# Biased variance estimator
σ_hat = ε_hat'*ε_hat / n;

# Unbiased estimator 1
σ_hat_2 = ε_hat'*ε_hat / (n-k);

# Unbiased estimator 2
σ_hat_3 = mean( ε_hat.^2 ./ (1 .- h) );

Code - R^2

# R squared - uncentered
R2_uc = (y_hat'*y_hat)/ (y'*y);

# R squared
y_bar = mean(y);
R2 = ((y_hat .- y_bar)'*(y_hat .- y_bar))/ ((y .- y_bar)'*(y .- y_bar));

Finite Sample Properties of OLS

Conditional Unbiasedness

Theorem

Under the GM assumptions (1)-(3), the OLS estimator is conditionally unbiased, i.e. the distribution of β^OLS is centered at β0: E[β^|X]=β0.

Proof E[β^|X]=E[(XX)1Xy|X]==(XX)1XE[y|X]==(XX)1XE[Xβ+ε|X]==(XX)1XXβ+(XX)1XE[ε|X]==β

OLS Variance

Theorem

Under the GM assumptions (1)-(3), Var(β^|X)=σ2(XX)1.

Proof: Var(β^|X)=Var((XX)1Xy|X)==((XX)1X)Var(y|X)((XX)1X)==((XX)1X)Var(Xβ+ε|X)((XX)1X)==((XX)1X)Var(ε|X)((XX)1X)==((XX)1X)σ2I((XX)1X)==σ2(XX)1

Higher correlation of the X implies higher variance of the OLS estimator.

Intuition: individual observations carry less information. You are exploring a smaller region of the X space.

BLUE

Theorem

Under the GM assumptions (1)-(3), Cov(β^,ε^)=0.

Theorem

Under the GM assumptions (1)-(3), β^OLS is the best (most efficient) linear, unbiased estimator (BLUE), i.e., for any unbiased linear estimator b: Var(b|X)Var(β^|X).

BLUE Proof

Consider four steps:

  1. Define three objects: (i) b=Cy, (ii) A=(XX)1X such that β^=Ay, and (iii) D=CA.
  2. Decompose b as b=(D+A)y==Dy+Ay==D(Xβ+ε)+β^==DXβ+Dε+β^
  3. By assumption, b must be unbiased: E[b|X]=E[D(Xβ+ε)+Ay|X]==E[DXβ|X]+E[Dε|X]+E[β^|X]==DXβ+DE[ε|X]+β=DXβ+β Hence, it must be that DX=0

BLUE Proof (2)

  1. We know by (2)-(3) that b=Dε+β^. We can now calculate its variance. Var(b|X)=Var(β^+Dε|X)==Var(Ay+Dε|X)==Var(AXβ+(D+A)ε|X)==Var((D+A)ε|X)==(D+A)σ2I(D+A)==σ2I(DD+AA+DA+AD)==σ2I(DD+AA)σ2AA==σ2(XX)1==Var(β^|X) since DA=AD=0, DX=0 and AA=(XX)1.

Var(b|X)Var(β^|X) is meant in a positive definite sense.

Code - Variance

# Ideal variance of the OLS estimator
var_β = σ * inv(X'*X)
## 2×2 Array{Float64,2}:
##   0.0609402  -0.0467732
##  -0.0467732   0.0656808
# Standard errors
std_β = sqrt.(diag(var_β))
## 2-element Array{Float64,1}:
##  0.24686077212177054
##  0.25628257446345265
Previous
Next