2021-10-29
A statistical model is a set of probability distributions \(\lbrace P \rbrace\).
More precisely, a statistical model over data \(D \in \mathcal{D}\) is a set of probability distribution over datasets \(D\) which takes values in \(\mathcal{D}\).
Suppose you have regression data \(\lbrace x_i , y_i \rbrace _ {i=1}^N\) with \(x_i \in \mathbb{R}^p\) and \(y_i \in \mathbb{R}\). The statistical model is
\[ \Big\lbrace P : y_i = f(x_i) + \varepsilon_i, \ x_i \sim F_x , \ \varepsilon_i \sim F _\varepsilon , \ \varepsilon_i \perp x_i , \ f \in C^2 (\mathbb{R}^p) \Big\rbrace \]
In words: the statistical model is the set of distributions \(P\) such that an additive decomposition of \(y_i\) as \(f(x_i) + \varepsilon_i\) exists for some \(x_i\); where \(f\) is twice continuously differentiable.
A data generating process (DGP) is a single statistical distribution over
A statistical model parameterized by \(\theta \in \Theta\) is well specified if the data generating process corresponds to some \(\theta_0\) and \(\theta_0 \in \Theta\). Otherwise, the statistical model is misspecified.
A statistical model can be parametrized as \(\mathcal{F} = \lbrace P_\theta \rbrace _ {\lbrace \theta \in \Theta \rbrace }\).
We can divide statistical models into 3 classes
Let \(\mathcal{D}\) be the set of possible data realizations. Let \(D \in \mathcal{D}\) be your data. Let \(\mathcal{F}\) be a statistical model indexed by some parameter \(\theta \in \Theta\). An estimator is a map \[ \mathcal{D} \to \mathcal{F} \quad , \quad D \mapsto \hat{\theta} \]
In words:
- An estimator is a map from the set of data realizations to the set of statistical models.
- It takes as inputs a dataset \(D\) and outputs a parameter estimate \(\hat \theta\).
Let \(\alpha > 0\) be a small tolerance. Statistical inference is a map into subsets of \(\mathcal{F}\) given by \[ \mathcal{D} \to \mathcal{G} \subseteq \mathcal{F}: \min _ \theta P_\theta (\mathcal{G} | \theta \in \mathcal{G}) \geq 1-\alpha \]
In words
- Inference maps datasets into sets of models
- The set contains only models that generate the observed data with high probability
- I.e. at least \(1-\alpha\)
A statistical hypothesis \(H_0\), is a subset of a statistical model, \(\mathcal K \subset \mathcal F\).
If \(\mathcal F\) is the statistical model and \(\mathcal K\) is the statistical hypothesis, we use the notation \(H_0 : P \in \mathcal K\).
Example
Common hypothesis are
- A single coefficient being equal to zero, \(\beta_k = c \in \mathbb R\)
- Multiple linear combination of coefficients being equal to some values: \(\boldsymbol R' \beta = r \in \mathbb R^p\)
A hypothesis test \(T\) is a map from the space of datasets to a decision, rejection (0) or acceptance (1) \[ \mathcal D \to \lbrace 0, 1 \rbrace \quad, \quad D \mapsto T \]
Generally, we are interested in understanding whether it is likely that data \(D\) are drawn from a model \(\mathcal K\) or not.
A hypothesis test, \(T\) is our tool for deciding whether the hypothesis is consistent with the data.
- \(T(D) = 0 \to\) fail to reject \(H_0\) and test inconclusive
- \(T (D) = 1 \to\) reject \(H_0\) and D is inconsistent with any \(P \in \mathcal K\)
Let \(\mathcal K \subset \mathcal F\) be a statistical hypothesis and \(T\) a hypothesis test.
The corresponding probability of a type I error is called size.
The corresponding probability of a type II error is called power (against the alternative P).
Test size is the probability of a Type I error, i.e. \[ \Pr \Big[ \text{ Reject } H_0 \Big| H_0 \text{ is true } \Big] = \Pr \Big[ T(D)=1 \Big| P \in \mathcal K \Big] \] A primary goal of test construction is to limit the incidence of Type I error by bounding the size of the test.
In the dominant approach to hypothesis testing the researcher pre-selects a significance level \(\alpha \in (0,1)\) and then selects the test so that its size is no larger than \(\alpha\).
Test power is the probability of a Type II error, i.e. \[ \Pr \Big[ \text{ Not Reject } H_0 \Big| H_0 \text{ is false } \Big] = \Pr \Big[ T(D)=0 \Big| P \in \mathcal K^C \Big] \]
In the dominant approach to hypothesis testing the goal of test construction is to have high power subject to the constraint that the size of the test is lower than the pre-specified significance level.
TBD
We now summarize the main features of hypothesis testing.
Let’s focus two hypotheses:
Consider the testing problem \(H_0 : \beta_k = c\), where \(c\) is a pre-specified value under the null. Suppose the variance of the esimator \(\hat \beta_k\) is known.
The t-statistic for this problem is defined by \[ n_{k}:=\frac{\hat \beta_{k} - c}{\sigma_{\hat \beta_{k}}} \] In the testing procedure above, the sampling distribution under the null \(H_0\) is given by \[ n_k \sim N(0,1) \] Where \(N(0,1)\) the standard normal distribution.
Consider the testing problem \(H_0 : \beta_k = c\), where \(c\) is a presepecified value under the null. In case the variance of the estimator \(\hat \beta_k\) is not known, we have to replace it with a consistent estimate \(\hat \sigma^2_{\hat \beta}\)
The t-statistic for this problem is defined by \[ t_{k}:=\frac{\hat \beta_{k} - c}{\hat \sigma_{\hat \beta_{k}}} \] In the testing procedure above, the sampling distribution under the null \(H_0\) is given by \[ t_k \sim t_{n-K} \] Where \(t_{n-K}\) denotes the t-distribution with \(n-K\) degress of freedom.
Consider the testing problem \(\boldsymbol R' \beta = r\), where \(\boldsymbol R \in \mathbb R^{p+K}\) is a pre-specified set of linear combinations and \(r \in \mathbb R^p\) is a restriction vector. Suppose the variance of the esimator \(\hat \beta\) is known.
The Wald statistic for this problem is given by \[ W := \frac{(R \hat \beta-r)^{\prime}(R \hat \beta-r) }{R' \sigma^{2}_{\hat \beta} R} \] In the testing procedure above, the sampling distribution under the null \(H_0\) is given by \[ W \sim \chi^2_{n-K} \] Where \(\chi^2_{n-K}\) denotes the chi-squared distribution with \(n-K\) degress of freedom.
Consider the testing problem \(\boldsymbol R' \beta = r\), where \(\boldsymbol R \in \mathbb R^{p+K}\) is a pre-specified set of linear combinations and \(r \in \mathbb R^p\) is a restriction vector. In case the variance of the estimator \(\hat \beta\) is not known, we have to replace it with a consistent estimate \(\hat \sigma^2_{\hat \beta}\).
The F-statistic for this problem is given by \[ F := \frac{(R \hat \beta-r)^{\prime}(R \hat \beta-r) / p }{R' \hat \sigma^{2} _ {\hat \beta} R} \] In the testing procedure above, the sampling distribution under the null \(H_0\) is given by \[ F \sim F_{p, n-K} \] Where \(F_{p, n-K}\) denotes the F-distribution with \(n-K\) degress of freedom, with \(p\) restrictions.
Consider the testing problem \(\boldsymbol R' \beta = r\), where \(\boldsymbol R \in \mathbb R^{p+K}\) is a pre-specified set of linear combinations and \(r \in \mathbb R^p\) is a restriction vector. Consider two estimators
Then the F statistic is numerically equivalent to the following expression \[ F = \frac{\left(S S R_{R}-S S R_{U}\right) / p}{S S R_{U} /(n-K)} \] where SSR is the sum of squared residuals.
TBD
TBD
Given a sequence of well specified data generating processes \(\mathcal F_n\), each indexed by the same parameter space \(\Theta\), with \(\theta_0\) a component of the true parameter for each \(n\).
Then estimator \(\hat \theta\) is
The asymptotic size of a testing procedure is defined as the limiting probability of rejecting \(H_0\) when \(H_0\) is true. Mathematically, we can write this as \(\lim _ {n \to \infty} \Pr_n ( \text{reject } H_0 | H_0)\), where the \(n\) subscript indexes the sample size.
A test is said to be consistent against the alternative \(H_1\) if the null hypothesis is rejected with probability approaching \(1\) when \(H_1\) is true: \(\lim _ {N \to \infty} \Pr_N (\text{reject } H_0 | H_1) \overset{p}{\to} 1\).
Theorem: Suppose that \(\sqrt{n}(\hat{\theta} - \theta_0) \overset{d}{\to} N(0, V)\), where \(V\) is positive definite. Then for any non-stochastic \(Q\times P\) matrix \(R\), \(Q \leq P\), with rank\(( R ) = Q\) \[ \sqrt{n} R (\hat{\theta} - \theta_0) \sim N(0, R VR') \] and \[ [\sqrt{n}R(\hat{\theta} - \theta_0)]'[RVR']^{-1}[\sqrt{n}R(\hat{\theta} - \theta_0)] \overset{d}{\to} \chi^2_Q \] In addition, if \(\text{plim} \hat{V} _n = V\), then \[ (\hat{\theta} - \theta_0)' R'[R (\hat{V} _n/n) R']^{-1}R (\hat{\theta} - \theta_0) \overset{d}{\to} \chi^2_Q \]
For testing the null hypothesis \(H_0: R\theta_0 = r\), where \(r\) is a \(Q\times1\) random vector, define the Wald statistic for testing \(H_0\) against \(H_1 : R\theta_0 \neq r\) as \[ W_n = (R\hat{\theta} - r)'[R (\hat{V} _n/n) R']^{-1} (R\hat{\theta} - r) \] Under \(H_0\), \(W_n \overset{d}{\to} \chi^2_Q\). If we abuse the asymptotics and we treat \(\hat{\theta}\) as being distributed as Normal we get the equation exactly.