OLS Inference
Last updated on Oct 29, 2021
Asymptotic Theory of the OLS Estimator
OLS Consistency
Theorem: Assume that
Proof:
We consider 4 steps:
by WLLN since iid and . by WLLN, due to iid, Cauchy-Schwarz and finite second moments of and by CMT. by CMT.
Variance and Assumptions
Now we are going to investigate the variance of
- Gaussian error term.
- Homoskedastic error term.
- Heteroskedastic error term.
- Heteroskedastic and autocorrelated error term.
Gaussian Error Term
Theorem: Under the GM assumption (1)-(5),
Proof:
We follow 2 steps:
- We can rewrite
as - Therefore:
.
Does it make sense to assume that
is gaussian? Not much. But does it make sense that is gaussian? Yes, because it’s an average.
Homoskedastic Error Term
Theorem: Under the assumptions of the previous theorem, plus
Proof:
Given that
is unobserved, we estimate it with . Since we have assumed homoskedastic error term, we have . Since we do not observe we estimate it as .
The terms
Heteroskedastic Error Term
Assumption:
Theorem: Under GM assumptions (1)-(4) plus heteroskedastic error
term, the following estimators are consistent,
i.e.
Note that we are only looking at
of the matrix.
- HC0: use the observed residual
When is too big relative to – i.e., – are too small ( biased towards zero). , and try to correct this small sample bias. - HC1: degree of freedom correction (default
robust
in Stata) - HC2: use standardized residuals
where is the leverage of the observation. A large means that observation is unusual in the sense that the regressor is far from its sample mean. - HC3: use prediction error, equivalent to Jack-knife estimator,
i.e.,
This estimator does not overfit when is relatively big with respect to . Idea: you exclude the corresponding observation when estimating a particular : .
HC0 Consistency
Theorem
Under regularity conditions HC0 is consistent,
i.e.
Why is the proof relevant? You cannot directly apply the WLLN to
.
Proof
For the case
by WLLN since is iid, by WLLN since and bounded.- By the triangle inequality,
- We want to show
where by bounded, iid and CMT. - We want to show that
with . Let (RV depending on ), with since We can depompose Hence
Heteroskedastic and Autocorrelated Error Term
Assumption
There esists a
for for
Intuition: observations far enough from each other are not correlated.
We can express the variance of the score as
We estimate
Theorem
If
What if
does not exist (all are correlated)? By the orthogonality property of the OLS residual.
HAC with Uniform Kernel
HAC with General Kernel
HAC Consistency
Theorem If the joint distribution is stationary and
for some smooth, symmetric, as ,
Then the HAC estimator is consistent.
Comments
We want to choose
In particular, HAC theory requires:
But in practice, long-run estimation implies
Choice of h
How to choose
It looks like after 10 periods the empirical autocorrelation is quite small but still not zero.
Fixed b Asymptotics
[Neave, 1970]: “When proving results on the asymptotic behavior of
estimates of the spectrum of a stationary time series, it is invariably
assumed that as the sample size
Fixed b Theorem
Theorem
Under regularity conditions,
The asymptotic critical values of the
Example
Example for the Bartlett kernel:
Fixed G Asymptotics
[Bester, 2013]: “Cluster covariance estimators are routinely used with data that has a group structure with independence assumed across groups. Typically, inference is conducted in such settings under the assumption that there are a large number of these independent groups.””
“However, with enough weakly dependent data, we show that groups can be chosen by the researcher so that group-level averages are approximately independent. Intuitively, if groups are large enough and well shaped (e.g. do not have gaps), the majority of points in a group will be far from other groups, and hence approximately independent of observations from other groups provided the data are weakly dependent. The key prerequisite for our methods is the researcher’s ability to construct groups whose averages are approximately independent. As we show later, this often requires that the number of groups be kept relatively small, which is why our main results explicitly consider a fixed (small) number of groups.””
Assumption
Assumption Suppose you have data
Let
The
It’s very similar too the HAC estimator since we have dependent cross-products here as well. However, here we do not consider the
cross-products. We only have time-dependency (state).
Comments (1)
On
- If
is fixed and , then the number of cross-products considered is much smaller than the total number of cross-products. - If
issues arise since the number of cross products considered is close to the total number of cross products. As in HAC estimation, this is a problem because it implies that the algebraic estimate of the cluster score variance gets close to zero because of the orthogonality property of the residuals. - The panel assumption is that observations across individuals are not correlated.
Strategy: as in HAC, we want to limit the correlation across clusters (individuals). We hope that observations are negligibly dependent between cluster sufficiently distant from each other.
Comments (2)
Classical cluster robust estimator:
On clusters:
- If the number of observations near a boundary is small relative to the sample size, ignoring the dependence should not affect inference too adversely.
- The higher the dimension of the data, the easier it is to have observations near boundaries (curse of dimensionality).
- We would like to have few clusters in order to make less independence assumptions. However, few clusters means bigger blocks and hence a larger number of cross-products to estimate. If the number of cross-products is too large (relative to the sample size),
does not converge
Theorem: Under regularity conditions:
Code - DGP
This code draws 100 observations from the model
# Set seed
Random.seed!(123);
# Set the number of observations
n = 100;
# Set the dimension of X
k = 2;
# Draw a sample of explanatory variables
X = rand(Uniform(0,1), n, k);
# Draw the error term
σ = 1;
ε = rand(Normal(0,1), n, 1) * sqrt(σ);
# Set the parameters
β = [2; -1];
# Calculate the dependent variable
y = X*β + ε;
Ideal Estimate
# OLS estimator
β_hat = (X'*X)\(X'*y);
# Residuals
ε_hat = y - X*β_hat;
# Homoskedastic standard errors
std_h = var(ε_hat) * inv(X'*X);
# Projection matrix
P = X * inv(X'*X) * X';
# Leverage
h = diag(P);
HC Estimates
# HC0 variance and standard errors
Ω_hc0 = X' * (I(n) .* ε_hat.^2) * X;
std_hc0 = sqrt.(diag(inv(X'*X) * Ω_hc0 * inv(X'*X)))
## 2-element Array{Float64,1}:
## 0.24691300271914793
## 0.28044707935951835
# HC1 variance and standard errors
Ω_hc1 = n/(n-k) * X' * (I(n) .* ε_hat.^2) * X;
std_hc1 = sqrt.(diag(inv(X'*X) * Ω_hc1 * inv(X'*X)))
## 2-element Array{Float64,1}:
## 0.24941979797977423
## 0.2832943308272532
# HC2 variance and standard errors
Ω_hc2 = X' * (I(n) .* ε_hat.^2 ./ (1 .- h)) * X;
std_hc2 = sqrt.(diag(inv(X'*X) * Ω_hc2 * inv(X'*X)))
## 2-element Array{Float64,1}:
## 0.2506509902982869
## 0.2850878737103963
# HC3 variance and standard errors
Ω_hc3 = X' * (I(n) .* ε_hat.^2 ./ (1 .- h).^2) * X;
std_hc3 = sqrt.(diag(inv(X'*X) * Ω_hc3 * inv(X'*X)))
## 2-element Array{Float64,1}:
## 0.25446321015850176
## 0.2898264779289438
# Note what happens if you allow for full autocorrelation
omega_full = X'*ε_hat*ε_hat'*X;
Inference
Hypothesis Testing
In order to do inference on
A statistical hypothesis is a subset of a statistical model,
Generally, we are interested in understanding whether it is likely that data
are drawn from or not.
A hypothesis test,
Let
- Suppose
. A Type I error (relative to ) is an event under . - Suppose
. A Type II error (relative to ) is an event under .
The corresponding probability of a type I error is called size. The
corresponding probability of a type II error is called power
(against the alternative
In this section, we are interested in testing three hypotheses, under the assumptions of linearity, strict exogeneity, no multicollinearity, normality on the error term. They are:
(single coefficient, , ) (linear combination, ) (linear restrictions, , full rank, )
Testing Problem
Consider the testing problem
Theorem: In the testing procedure above, the sampling distribution
under the null
Example
We want to asses whether or not the ``true” coefficient
- Null Hypothesis:
- Alternative Hypothesis:
.
Hence, we are interested in a statistic informative about
However, the true variance
Comments
Hypothesis testing is like proof by contradiction. Imagine the sampling
distribution was generated by
Then, given a realized value of the statistic
- Do not reject
: it is consistent with random variation under true —i.e., small as it has an exact student t distribution with degree of freedom in the normal regression model. - Reject
in favor of : , with being the critical values selected to control for false rejections: . Moreover, you can also reject if the p-value is such that: .
Comments (2)
The probability of false rejection is decreasing in
Example: Consider the testing problem
t Stat
Theorem
In the testing procedure above, the sampling distribution under the null
Like in the previous test,
F Stat
Example
Consider the testing problem
The F-statistic for this problem is given by
Theorem
For the problem, the sampling distribution of the F-statistic under the
null
The test is intrinsically two-sided. The above sampling distribution can be used to construct a confidence interval.
Equivalence
Theorem
Consider the testing problem
Consider the restricted least squares estimator, denoted
Confidence Intervals
A confidence interval at
Since
- they are symmetric around
; - their length is proportional to
.
A CI is equivalent to the set of parameter values such that the
t-statistic is less than
In practice, to construct a 95% confidence interval for a single
coefficient estimate
Code
# t-test for beta=0
t = abs.(β_hat ./ (std_hc1));
# p-value
p_val = 1 .- cdf.(Normal(0,1), t);
# F statistic of joint significance
SSR_u = ε_hat'*ε_hat;
SSR_r = y'*y;
F = (SSR_r - SSR_u)/k / (SSR_u/(n-k));
# 95# confidente intervals
conf_int = [β_hat - 1.96*std_hc1, β_hat + 1.96*std_hc1];