OLS Inference

Last updated on Oct 29, 2021

Asymptotic Theory of the OLS Estimator

OLS Consistency

Theorem: Assume that (xi,yi)i=1n i.i.d. , E[xixi]=Q positive definite, E[xixi]< and E[yi2]<, then β^OLS is a consistent estimator of β0, i.e. β^=En[xixi]En[xiyi]pβ0.

Proof:
We consider 4 steps:

  1. En[xixi]pE[xixi] by WLLN since xixi iid and E[xixi]<.
  2. En[xiyi]pE[xiyi] by WLLN, due to xiyi iid, Cauchy-Schwarz and finite second moments of xi and yi E[xiyi]E[xi2]E[yi2]<
  3. En[xixi]1pE[xixi]1 by CMT.
  4. En[xixi]1En[xiyi]pE[xixi]1E[xiyi]=β by CMT.

Variance and Assumptions

Now we are going to investigate the variance of β^OLS progressively relaxing the underlying assumptions.

  • Gaussian error term.
  • Homoskedastic error term.
  • Heteroskedastic error term.
  • Heteroskedastic and autocorrelated error term.

Gaussian Error Term

Theorem: Under the GM assumption (1)-(5), β^β|XN(0,σ2(XX)1)

Proof:
We follow 2 steps:

  1. We can rewrite β^ as β^=(XX)1Xy=(XX)1X(Xβ+ε)=β+(XX)1Xε==β+En[xixi]1En[xiεi]
  2. Therefore: β^β=En[xixi]1En[xiεi]. β^β|X(XX)1XN(0,σ2In)==N(0,σ2(XX)1XX(XX)1)==N(0,σ2(XX)1)

Does it make sense to assume that ε is gaussian? Not much. But does it make sense that β^ is gaussian? Yes, because it’s an average.

Homoskedastic Error Term

Theorem: Under the assumptions of the previous theorem, plus E[x4]<, the OLS estimate has an asymptotic normal distribution: β^|XdN(β,σ2(XX)1).

Proof: n(β^β)=En[xixi]1pQ1nEn[xiεi]dN(0,Ω)N(0,Σ) where in general Ω=Var(xiεi)=E[(xiεi)2] and Σ=Q1ΩQ1.

Given that Q=E[xixi] is unobserved, we estimate it with Q^=En[xixi]. Since we have assumed homoskedastic error term, we have Ω=σ2(XX)1. Since we do not observe σ2 we estimate it as σ^2=En[ε^i2].

The terms xiεi are called scores and we can already see their central importance for inference.

Heteroskedastic Error Term

Assumption: E[εixiεjxj]=0, for all ji and E[εi4], E[||xi||4]C< a.s.

Theorem: Under GM assumptions (1)-(4) plus heteroskedastic error term, the following estimators are consistent, i.e. Σ^pΣ.

Note that we are only looking at Ω of the Σ=Q1ΩQ1 matrix.

  • HC0: use the observed residual ε^i ΩHC0=En[xixiε^i2] When k is too big relative to n – i.e., k/nc>0ε^i2 are too small (ΩHC0 biased towards zero). ΩHC1, ΩHC2 and ΩHC3 try to correct this small sample bias.
  • HC1: degree of freedom correction (default robust in Stata) ΩHC1=1nkEn[xixiε^i2]
  • HC2: use standardized residuals ΩHC2=En[xixiε^i2(1hii)1] where hii=[X(XX)1X]ii is the leverage of the ith observation. A large hii means that observation i is unusual in the sense that the regressor xi is far from its sample mean.
  • HC3: use prediction error, equivalent to Jack-knife estimator, i.e., En[xixiε^(i)2] ΩHC3=En[xixiε^i2(1hii)2] This estimator does not overfit when k is relatively big with respect to n. Idea: you exclude the corresponding observation when estimating a particular εi: ε^i=yixiβ^i.

HC0 Consistency

Theorem

Under regularity conditions HC0 is consistent, i.e. Σ^HC0pΣ. Σ^=Q^1Ω^Q^1pΣ with Ω^=En[xixiε^i2] and Q^=En[xixi]1

Why is the proof relevant? You cannot directly apply the WLLN to Σ^.

Proof

For the case dim(xi)=1.

  1. Q^1pQ1 by WLLN since xi is iid, E[xi4]<
  2. Ω¯=En[εi2xixi]pΩ by WLLN since En[εi4]<c and xi bounded.
  3. By the triangle inequality, |Ω^Ω^||ΩΩ¯|p0+|Ω¯Ω^|WTS:p0
  4. We want to show |Ω¯Ω^|p0 |Ω¯Ω^|=En[εi2xi2]En[ε^i2xi2]==En[(εi2ε^i2)xi2]En[(εi2ε^i2)2]12En[xi4]12 where En[xi4]12pE[xi4]12 by xi bounded, iid and CMT.
  5. We want to show that En[(εi2ε^i2)2]η with η0. Let L=maxi|ε^iεi| (RV depending on n), with Lp0 since |ε^iεi|=|xiβ^xiβ||xi||β^β|pc0 We can depompose (εi2ε^i2)2=(εiε^i)2(εi+ε^i)2(εi+ε^i)2L2==(2εiεi+ε^i)2L2(2(2εi)2+2(ε^iεi)2)2L2(8εi2+2L2)L2 Hence E[(εi2ε^i2)2]L2(8En[εi2]+2En[L2])p0

Heteroskedastic and Autocorrelated Error Term

Assumption

There esists a d¯ such that:

  • E[εixiεidxid]0 for dd¯
  • E[εixiεidxid]=0 for d>d¯

Intuition: observations far enough from each other are not correlated.

We can express the variance of the score as Ωn=Var(nEn[xiεi])==E[(1ni=1nxiεi)(1nj=1nxjεj)]==1ni=1nj=1nE[xiεixjεj]==1ni=1nj:|ij|d¯E[xiεixjεj]==1nd=0d¯i=dnE[xiεixidεid]

We estimate Ωn by Ω^n=1nd=0d¯i=dnxiε^ixidε^id

Theorem

If d¯ is a fixed integer, then Ω^nΩnp0

What if d¯ does not exist (all xi,xj are correlated)? Ω^n=1nd=0ni=dnxiε^ixidε^id=nEn[xiε^i]2=0 By the orthogonality property of the OLS residual.

HAC with Uniform Kernel Ω^h=1ni,jxiε^ixjε^jI{|ij|h} where h is the bandwidth of the kernel. The bandwidth is chosen such that E[xiεixidεid] is small for d>h. How small? Small enough for the estimates to be consistent.

HAC with General Kernel Ω^k,hHAC=1ni,jxiε^ixjε^jk(|ij|n)

HAC Consistency

Theorem If the joint distribution is stationary and α-mixing with k=1k2α(k)< and

  • E[|xijεi|ν]< ν
  • ε^i=yixiβ^ for some β^pβ0
  • k smooth, symmetric, k(0) as z, k2<
  • hn0
  • h

Then the HAC estimator is consistent. Ω^k,hHACΩnp0

Comments

We want to choose h small relative to n in order to avoid estimation problems. But we also want to choose h large so that the remainder is small: Ωn=Var(nEn[xiεi])==1ni,j:|ij|hE[xiεixjεj]Ωnh+1ni,j:|ij|>hE[xiεixjεj]remainder: Rn==Ωnh+Rn

In particular, HAC theory requires: Ω^HACpΩ if {hn0h

But in practice, long-run estimation implies hn0 which is not ``safe” in the sense that it does not imply Rn0. On the other hand, if hn, Ω^HAC does not converge in probability because it’s too noisy.

Choice of h

How to choose h? Look at the score autocorrelation function (ACF).

Autocorrelation Function

It looks like after 10 periods the empirical autocorrelation is quite small but still not zero.

Fixed b Asymptotics

[Neave, 1970]: “When proving results on the asymptotic behavior of estimates of the spectrum of a stationary time series, it is invariably assumed that as the sample size n tends to infinity, so does the truncation point h, but at a slower rate, so that hn tends to zero. This is a convenient assumption mathematically in that, in particular, it ensures consistency of the estimates, but it is unrealistic when such results are used as approximations to the finite case where the value of hn cannot be zero.””

Fixed b Theorem

Theorem

Under regularity conditions, n(Vk,hHAC)(β^β0)dF

The asymptotic critical values of the F statistic depend on the choice of the kernel. In order to do hypothesis testing, Kiefer and Vogelsang(2005) provide critical value functions for the t-statistic for each kernel-confidence level combination using a cubic equation: cv(b)=a0+a1b+a2b2+a3b3

Example

Example for the Bartlett kernel:

Fixed-b

Fixed G Asymptotics

[Bester, 2013]: “Cluster covariance estimators are routinely used with data that has a group structure with independence assumed across groups. Typically, inference is conducted in such settings under the assumption that there are a large number of these independent groups.””

However, with enough weakly dependent data, we show that groups can be chosen by the researcher so that group-level averages are approximately independent. Intuitively, if groups are large enough and well shaped (e.g. do not have gaps), the majority of points in a group will be far from other groups, and hence approximately independent of observations from other groups provided the data are weakly dependent. The key prerequisite for our methods is the researcher’s ability to construct groups whose averages are approximately independent. As we show later, this often requires that the number of groups be kept relatively small, which is why our main results explicitly consider a fixed (small) number of groups.””

Assumption

Assumption Suppose you have data D=(yit,xit)i=1,t=1N,T where yit=xitβ+αi+εit where i indexes the observational unit and t indexes time (could also be space).

Let y~it=yit1Tt=1Tyitx~it=xit1Tt=1Txitε~it=εit1Tt=1Tεit Then y~it=x~itβ+ε~it

The ε~it are by construction correlated between each other even if the original ε was iid. The cluster score variance estimator is given by: Ω^CL=1T1i=1nt=1Ts=1Tx~itε~^itx~isε~^is

It’s very similar too the HAC estimator since we have dependent cross-products here as well. However, here we do not consider the i×j cross-products. We only have time-dependency (state).

Comments (1)

On T and n:

  • If T is fixed and n, then the number of cross-products considered is much smaller than the total number of cross-products.
  • If T»n issues arise since the number of cross products considered is close to the total number of cross products. As in HAC estimation, this is a problem because it implies that the algebraic estimate of the cluster score variance gets close to zero because of the orthogonality property of the residuals.
  • The panel assumption is that observations across individuals are not correlated.

Strategy: as in HAC, we want to limit the correlation across clusters (individuals). We hope that observations are negligibly dependent between cluster sufficiently distant from each other.

Comments (2)

Classical cluster robust estimator: Ω^CL=1ni=1nxiεixjεjI{i,j in the same cluster}

On clusters:

  • If the number of observations near a boundary is small relative to the sample size, ignoring the dependence should not affect inference too adversely.
  • The higher the dimension of the data, the easier it is to have observations near boundaries (curse of dimensionality).
  • We would like to have few clusters in order to make less independence assumptions. However, few clusters means bigger blocks and hence a larger number of cross-products to estimate. If the number of cross-products is too large (relative to the sample size), Ω^CL does not converge

Theorem: Under regularity conditions: t^dGG1tG1

Code - DGP

This code draws 100 observations from the model y=2x1x2+ε where x1,x2U[0,1] and εN(0,1).

# Set seed
Random.seed!(123);

# Set the number of observations
n = 100;

# Set the dimension of X
k = 2;

# Draw a sample of explanatory variables
X = rand(Uniform(0,1), n, k);

# Draw the error term
σ = 1;
ε = rand(Normal(0,1), n, 1) * sqrt(σ);

# Set the parameters
β = [2; -1];

# Calculate the dependent variable
y = X*β + ε;

Ideal Estimate

# OLS estimator
β_hat = (X'*X)\(X'*y);

# Residuals
ε_hat = y - X*β_hat;

# Homoskedastic standard errors
std_h = var(ε_hat) * inv(X'*X);

# Projection matrix
P = X * inv(X'*X) * X';

# Leverage
h = diag(P);

HC Estimates

# HC0 variance and standard errors
Ω_hc0 = X' * (I(n) .* ε_hat.^2) * X;
std_hc0 = sqrt.(diag(inv(X'*X) * Ω_hc0 * inv(X'*X)))
## 2-element Array{Float64,1}:
##  0.24691300271914793
##  0.28044707935951835
# HC1 variance and standard errors
Ω_hc1 = n/(n-k) * X' * (I(n) .* ε_hat.^2) * X;
std_hc1 = sqrt.(diag(inv(X'*X) * Ω_hc1 * inv(X'*X)))
## 2-element Array{Float64,1}:
##  0.24941979797977423
##  0.2832943308272532
# HC2 variance and standard errors
Ω_hc2 = X' * (I(n) .* ε_hat.^2 ./ (1 .- h)) * X;
std_hc2 = sqrt.(diag(inv(X'*X) * Ω_hc2 * inv(X'*X)))
## 2-element Array{Float64,1}:
##  0.2506509902982869
##  0.2850878737103963
# HC3 variance and standard errors
Ω_hc3 = X' * (I(n) .* ε_hat.^2 ./ (1 .- h).^2) * X;
std_hc3 = sqrt.(diag(inv(X'*X) * Ω_hc3 * inv(X'*X)))
## 2-element Array{Float64,1}:
##  0.25446321015850176
##  0.2898264779289438
# Note what happens if you allow for full autocorrelation
omega_full = X'*ε_hat*ε_hat'*X;

Inference

Hypothesis Testing

In order to do inference on β^ we need to know its distribution. We have two options: (i) assume gaussian error term (extended GM) or (ii) rely on asymptotic approximations (CLT).

A statistical hypothesis is a subset of a statistical model, KF. A hypothesis test is a map D{0,1}, DT. If F is the statistical model and K is the statistical hypothesis, we use the notation H0:PrK.

Generally, we are interested in understanding whether it is likely that data D are drawn from K or not.

A hypothesis test, T is our tool for deciding whether the hypothesis is consistent with the data. T(D)=0 implies fail to reject H0 and test inconclusive T(D)=1 reject H0 and D is inconsistent with any PrK.

Let KF be a statistical hypothesis and T a hypothesis test.

  1. Suppose PrK. A Type I error (relative to Pr) is an event T(D)=1 under Pr.
  2. Suppose PrKc. A Type II error (relative to Pr) is an event T(D)=0 under Pr.

The corresponding probability of a type I error is called size. The corresponding probability of a type II error is called power (against the alternative Pr).

In this section, we are interested in testing three hypotheses, under the assumptions of linearity, strict exogeneity, no multicollinearity, normality on the error term. They are:

  1. H0:β0k=β¯0k (single coefficient, β¯0kR, kK)
  2. aβ0=c (linear combination, aRK,cR)
  3. Rβ0=r (linear restrictions, RRp×K, full rank, rRp)

Testing Problem

Consider the testing problem H0:β0k=β¯0k where β¯0k is a pre-specified value under the null. The t-statistic for this problem is defined by tk:=bkβ¯0kSE(bk),  SE(bk):=s2[(XX)1]kk

Theorem: In the testing procedure above, the sampling distribution under the null H0 is given by tk|Xtnk  and so  tktnk

t(nK) denotes the t-distribution with (nk) degress of freedom. The test can be one sided or two sided. The above sampling distribution can be used to construct a confidence interval.

Example

We want to asses whether or not the ``true” coefficient β0 equals a specific value β^. Specifically, we are interested in testing H0 against H1, where:

  • Null Hypothesis: H0:β0=β^
  • Alternative Hypothesis: H1:β0β^.

Hence, we are interested in a statistic informative about H1, which is the Wald test statistic |T|=|β^β0σ(β^)|N(0,1)

However, the true variance σ2(β^) is not known and has to be estimated. Therefore we plug in the sample variance σ^2(β^)=nn1En[e^i2] and we use |T|=|β^β0σ^(β^)|t(nk)

Comments

Hypothesis testing is like proof by contradiction. Imagine the sampling distribution was generated by β. If it is highly improbable to observe β^ given β0=β then we reject the hypothesis that the sampling distribution was generated by β.

Then, given a realized value of the statistic |T|, we take the following decision:

  • Do not reject H0: it is consistent with random variation under true H0—i.e., |T| small as it has an exact student t distribution with (nk) degree of freedom in the normal regression model.
  • Reject H0 in favor of H1: |T|>c, with c being the critical values selected to control for false rejections: Pr(|tnk|c)=α. Moreover, you can also reject H0 if the p-value p is such that: p<α.

Comments (2)

The probability of false rejection is decreasing in c, i.e. the critical value for a given significant level. Pr(Reject H0|H0)=Pr(|T|>c|H0)==Pr(T>c|H0)+Pr(T<c|H0)==1F(c)+F(c)=2(1F(c))

Example: Consider the testing problem H0:aβ0=c where a is a pre-specified linear combination under study. The t-statistic for this problem is defined by: tk:=abcSE(ab),  SE(ab):=s2a(XX)1a

t Stat

Theorem

In the testing procedure above, the sampling distribution under the null H0 is given by ta|XtnKand sotatnK

Like in the previous test, t(nK) denotes the t-distribution with (nK) degress of freedom. The test can again be one sided or two sided. The above sampling distribution can be used to construct a confidence interval

F Stat

Example

Consider the testing problem H0:Rβ0=r where RRp×k is a presepecified set of linear combinations and rRp is a restriction vector.

The F-statistic for this problem is given by F:=(Rbr)[R(XX)R]1(Rbr)/ps2

Theorem

For the problem, the sampling distribution of the F-statistic under the null H0: F|XFp,nK  and so  FFp,nK

The test is intrinsically two-sided. The above sampling distribution can be used to construct a confidence interval.

Equivalence

Theorem

Consider the testing problem H0:Rβ0=r where RRp×K is a presepecified set of linear combinations and rRp is a restriction vector.

Consider the restricted least squares estimator, denoted β^R: β^R:=argminβ:Rβ=rQ(β). Let SSRU=Q(b),  SSRR=Q(β^R). Then the F statistic is numerically equivalent to the following expression: F=(SSRRSSRU)/pSSRU/(nK).

Confidence Intervals

A confidence interval at (1α) is a random set C such that Pr(β0C)1α i.e. the probability that C covers the true value β is fixed at (1α).

Since C is not known, it has to be estimated (C^). We construct confidence intervals such that:

  • they are symmetric around β^;
  • their length is proportional to σ(β^)=Var(β^).

A CI is equivalent to the set of parameter values such that the t-statistic is less than c, i.e., C^={β:|T(β)|c}={β:cββ^σ(β^)c}

In practice, to construct a 95% confidence interval for a single coefficient estimate β^j, we use the fact that Pr(|β^jβ0,j|σ2[(XX)1]jj>1.96)=0.05

Code

# t-test for beta=0
t = abs.(β_hat ./ (std_hc1));

# p-value
p_val = 1 .- cdf.(Normal(0,1), t);

# F statistic of joint significance
SSR_u = ε_hat'*ε_hat;
SSR_r = y'*y;
F = (SSR_r - SSR_u)/k / (SSR_u/(n-k));

# 95# confidente intervals
conf_int = [β_hat - 1.96*std_hc1, β_hat + 1.96*std_hc1];
Previous
Next