Inference

2021-10-29

Statistical Models

Definition

A statistical model is a set of probability distributions \(\lbrace P \rbrace\).

More precisely, a statistical model over data \(D \in \mathcal{D}\) is a set of probability distribution over datasets \(D\) which takes values in \(\mathcal{D}\).

Suppose you have regression data \(\lbrace x_i , y_i \rbrace _ {i=1}^N\) with \(x_i \in \mathbb{R}^p\) and \(y_i \in \mathbb{R}\). The statistical model is

\[ \Big\lbrace P : y_i = f(x_i) + \varepsilon_i, \ x_i \sim F_x , \ \varepsilon_i \sim F _\varepsilon , \ \varepsilon_i \perp x_i , \ f \in C^2 (\mathbb{R}^p) \Big\rbrace \]

In words: the statistical model is the set of distributions \(P\) such that an additive decomposition of \(y_i\) as \(f(x_i) + \varepsilon_i\) exists for some \(x_i\); where \(f\) is twice continuously differentiable.

A data generating process (DGP) is a single statistical distribution over

Parametrization

A statistical model parameterized by \(\theta \in \Theta\) is well specified if the data generating process corresponds to some \(\theta_0\) and \(\theta_0 \in \Theta\). Otherwise, the statistical model is misspecified.

A statistical model can be parametrized as \(\mathcal{F} = \lbrace P_\theta \rbrace _ {\lbrace \theta \in \Theta \rbrace }\).

We can divide statistical models into 3 classes

Parametric: the stochastic features of the model are completly specified up to a finite dimensional parameter: \(\lbrace P_\theta \rbrace _ { \lbrace \theta \in \Theta \rbrace }\) with \(\Theta \subseteq \mathbb{R}^k, k<\infty\);

Semiparametric: it is a partially specified model, e.g., \(\lbrace P_\theta \rbrace _ { \lbrace \theta \in \Theta, \gamma \in \Gamma \rbrace }\) with \(\Theta\) of finite dimension and \(\Gamma\) of infinite dimension;

Non parametric: there is no finite dimensional component of the model.

Estimation

Let \(\mathcal{D}\) be the set of possible data realizations. Let \(D \in \mathcal{D}\) be your data. Let \(\mathcal{F}\) be a statistical model indexed by some parameter \(\theta \in \Theta\). An estimator is a map \[ \mathcal{D} \to \mathcal{F} \quad , \quad D \mapsto \hat{\theta} \]

In words:

An estimator is a map from the set of data realizations to the set of statistical models.

It takes as inputs a dataset \(D\) and outputs a parameter estimate \(\hat \theta\).

Inference

Let \(\alpha > 0\) be a small tolerance. Statistical inference is a map into subsets of \(\mathcal{F}\) given by \[ \mathcal{D} \to \mathcal{G} \subseteq \mathcal{F}: \min _ \theta P_\theta (\mathcal{G} | \theta \in \mathcal{G}) \geq 1-\alpha \]

In words

Inference maps datasets into sets of models

The set contains only models that generate the observed data with high probability

I.e. at least \(1-\alpha\)

Hypotesis Testing

Hypothesis

A statistical hypothesis \(H_0\), is a subset of a statistical model, \(\mathcal K \subset \mathcal F\).

If \(\mathcal F\) is the statistical model and \(\mathcal K\) is the statistical hypothesis, we use the notation \(H_0 : P \in \mathcal K\).

Example

Common hypothesis are

A single coefficient being equal to zero, \(\beta_k = c \in \mathbb R\)

Multiple linear combination of coefficients being equal to some values: \(\boldsymbol R' \beta = r \in \mathbb R^p\)

Test

A hypothesis test \(T\) is a map from the space of datasets to a decision, rejection (0) or acceptance (1) \[ \mathcal D \to \lbrace 0, 1 \rbrace \quad, \quad D \mapsto T \]

Generally, we are interested in understanding whether it is likely that data \(D\) are drawn from a model \(\mathcal K\) or not.

A hypothesis test, \(T\) is our tool for deciding whether the hypothesis is consistent with the data.

\(T(D) = 0 \to\) fail to reject \(H_0\) and test inconclusive

\(T (D) = 1 \to\) reject \(H_0\) and D is inconsistent with any \(P \in \mathcal K\)

Errors

Let \(\mathcal K \subset \mathcal F\) be a statistical hypothesis and \(T\) a hypothesis test.

A Type I error is an event \(T(D)=1\) under \(P \in \mathcal K\).
- In words: rejecting the null hypothesis, when it is is true
A Type II error is an event \(T(D)=0\) under \(P \in \mathcal K^C\).
- In words: not rejecting the null hypothesis, when it is false

The corresponding probability of a type I error is called size.

The corresponding probability of a type II error is called power (against the alternative P).

Type I Error and Test Size

Test size is the probability of a Type I error, i.e. \[ \Pr \Big[ \text{ Reject } H_0 \Big| H_0 \text{ is true } \Big] = \Pr \Big[ T(D)=1 \Big| P \in \mathcal K \Big] \] A primary goal of test construction is to limit the incidence of Type I error by bounding the size of the test.

In the dominant approach to hypothesis testing the researcher pre-selects a significance level \(\alpha \in (0,1)\) and then selects the test so that its size is no larger than \(\alpha\).

Type II Error and Power

Test power is the probability of a Type II error, i.e. \[ \Pr \Big[ \text{ Not Reject } H_0 \Big| H_0 \text{ is false } \Big] = \Pr \Big[ T(D)=0 \Big| P \in \mathcal K^C \Big] \]

In the dominant approach to hypothesis testing the goal of test construction is to have high power subject to the constraint that the size of the test is lower than the pre-specified significance level.

Statistical Significance

TBD

P-Values

Recap

We now summarize the main features of hypothesis testing.

Select a significance level \(\alpha\).
Select a test statistic \(T\) with asymptotic distribution \(T\to \xi\) under \(H_0\).
Set the asymptotic critical value \(c\) so that 1−G(c)=α, where G is the distribution function of \(\xi\).
Calculate the asymptotic p-value p=1−G(T).
Reject \(H_0\) if T > c, or equivalently p < α.
Accept \(H_0\) if T ≤ c, or equivalently p ≥ α.
Report \(p\) to summarize the evidence concerning \(H_0\) versus \(H_1\).

Examples

Let’s focus two hypotheses:

\(\beta_k = c \in \mathbb R\)
\(\boldsymbol R' \beta = r \in \mathbb R^p\)

t-test with Known Variance

Consider the testing problem \(H_0 : \beta_k = c\), where \(c\) is a pre-specified value under the null. Suppose the variance of the esimator \(\hat \beta_k\) is known.

The t-statistic for this problem is defined by \[ n_{k}:=\frac{\hat \beta_{k} - c}{\sigma_{\hat \beta_{k}}} \] In the testing procedure above, the sampling distribution under the null \(H_0\) is given by \[ n_k \sim N(0,1) \] Where \(N(0,1)\) the standard normal distribution.

t-test with Unknown Variance

Consider the testing problem \(H_0 : \beta_k = c\), where \(c\) is a presepecified value under the null. In case the variance of the estimator \(\hat \beta_k\) is not known, we have to replace it with a consistent estimate \(\hat \sigma^2_{\hat \beta}\)

The t-statistic for this problem is defined by \[ t_{k}:=\frac{\hat \beta_{k} - c}{\hat \sigma_{\hat \beta_{k}}} \] In the testing procedure above, the sampling distribution under the null \(H_0\) is given by \[ t_k \sim t_{n-K} \] Where \(t_{n-K}\) denotes the t-distribution with \(n-K\) degress of freedom.

Wald-test

Consider the testing problem \(\boldsymbol R' \beta = r\), where \(\boldsymbol R \in \mathbb R^{p+K}\) is a pre-specified set of linear combinations and \(r \in \mathbb R^p\) is a restriction vector. Suppose the variance of the esimator \(\hat \beta\) is known.

The Wald statistic for this problem is given by \[ W := \frac{(R \hat \beta-r)^{\prime}(R \hat \beta-r) }{R' \sigma^{2}_{\hat \beta} R} \] In the testing procedure above, the sampling distribution under the null \(H_0\) is given by \[ W \sim \chi^2_{n-K} \] Where \(\chi^2_{n-K}\) denotes the chi-squared distribution with \(n-K\) degress of freedom.

Comments on the Wald test

The Wald statistic \(W\) is a weighted Euclidean measure of the length of the vector \(R \hat \beta-r\)
The Wald test is intrinsecally 2-sided
When \(p=1\) then \(W = |T|\) , the square of the t-statistic, so hypothesis tests based on \(W\) and \(|T|\) are equivalent.

F-test

Consider the testing problem \(\boldsymbol R' \beta = r\), where \(\boldsymbol R \in \mathbb R^{p+K}\) is a pre-specified set of linear combinations and \(r \in \mathbb R^p\) is a restriction vector. In case the variance of the estimator \(\hat \beta\) is not known, we have to replace it with a consistent estimate \(\hat \sigma^2_{\hat \beta}\).

The F-statistic for this problem is given by \[ F := \frac{(R \hat \beta-r)^{\prime}(R \hat \beta-r) / p }{R' \hat \sigma^{2} _ {\hat \beta} R} \] In the testing procedure above, the sampling distribution under the null \(H_0\) is given by \[ F \sim F_{p, n-K} \] Where \(F_{p, n-K}\) denotes the F-distribution with \(n-K\) degress of freedom, with \(p\) restrictions.

F-test Equivalence

\(\hat \beta_U = \arg \min_b \frac{1}{n} (y - X \beta)' (y - X\beta)\)
\(\hat \beta_R = \arg \min_{b : \boldsymbol R' \beta = r} \frac{1}{n} (y - X \beta)' (y - X\beta)\)

Then the F statistic is numerically equivalent to the following expression \[ F = \frac{\left(S S R_{R}-S S R_{U}\right) / p}{S S R_{U} /(n-K)} \] where SSR is the sum of squared residuals.

Confidence Intervals

TBD

Minimum Distance Tests

TBD

Asymptotics

Estimator Properties

Given a sequence of well specified data generating processes \(\mathcal F_n\), each indexed by the same parameter space \(\Theta\), with \(\theta_0\) a component of the true parameter for each \(n\).

Then estimator \(\hat \theta\) is

unbiased if \(\mathbb E [\hat \theta] = \theta_0\)
consistent if \(\hat \theta \overset{p}{\to} \theta_0\)
asymptotically normal \(\sqrt{n} (\hat \theta - \theta_0) \overset{d}{\to} N(0, V)\) for some positive definite \(V\)

Test Consistency

The asymptotic size of a testing procedure is defined as the limiting probability of rejecting \(H_0\) when \(H_0\) is true. Mathematically, we can write this as \(\lim _ {n \to \infty} \Pr_n ( \text{reject } H_0 | H_0)\), where the \(n\) subscript indexes the sample size.

A test is said to be consistent against the alternative \(H_1\) if the null hypothesis is rejected with probability approaching \(1\) when \(H_1\) is true: \(\lim _ {N \to \infty} \Pr_N (\text{reject } H_0 | H_1) \overset{p}{\to} 1\).

Convergence

Theorem: Suppose that \(\sqrt{n}(\hat{\theta} - \theta_0) \overset{d}{\to} N(0, V)\), where \(V\) is positive definite. Then for any non-stochastic \(Q\times P\) matrix \(R\), \(Q \leq P\), with rank\(( R ) = Q\) \[ \sqrt{n} R (\hat{\theta} - \theta_0) \sim N(0, R VR') \] and \[ [\sqrt{n}R(\hat{\theta} - \theta_0)]'[RVR']^{-1}[\sqrt{n}R(\hat{\theta} - \theta_0)] \overset{d}{\to} \chi^2_Q \] In addition, if \(\text{plim} \hat{V} _n = V\), then \[ (\hat{\theta} - \theta_0)' R'[R (\hat{V} _n/n) R']^{-1}R (\hat{\theta} - \theta_0) \overset{d}{\to} \chi^2_Q \]

Wald Statistic

For testing the null hypothesis \(H_0: R\theta_0 = r\), where \(r\) is a \(Q\times1\) random vector, define the Wald statistic for testing \(H_0\) against \(H_1 : R\theta_0 \neq r\) as \[ W_n = (R\hat{\theta} - r)'[R (\hat{V} _n/n) R']^{-1} (R\hat{\theta} - r) \] Under \(H_0\), \(W_n \overset{d}{\to} \chi^2_Q\). If we abuse the asymptotics and we treat \(\hat{\theta}\) as being distributed as Normal we get the equation exactly.