2021-10-29
A probability space is a triple \((\Omega, \mathcal A, P)\) where
The sample space \(\Omega\) is the space of all possible events.
What is a \(\sigma\)-algebra and a probability measure?
A nonempty set (of subsets of \(\Omega\)) \(\mathcal A \in 2^\Omega\) is a sigma algebra (\(\sigma\)-algebra) of \(\Omega\) if the following conditions hold:
The smallest \(\sigma\)-algebra is \(\lbrace \emptyset, \Omega \rbrace\) and the largest one is \(2^\Omega\) (in cardinality terms).
Suppose \(\Omega = \mathbb R\). Let \(\mathcal{C} = \lbrace (a, b],-\infty \leq a<b<\infty \rbrace\). Then the Borel \(\sigma\)- algebra on \(\mathbb R\) is defined by \[ \mathcal B (\mathbb R) = \sigma (\mathcal C) \]
A probability measure \(P: \mathcal A \to [0,1]\) is a set function with domain \(\mathcal A\) and codomain \([0,1]\) such that
Some properties of probability measures
Let \(A, B \in \mathcal A\) and \(P(B) > 0\), the conditional probability of \(A\) given \(B\) is \[ P(A | B)=\frac{P(A \cap B)}{P(B)} \]
Two events \(A\) and \(B\) are independent if \(P(A \cap B)=P(A) P(B)\).
Theorem (Law of Total Probability)
Let \((E_n) _ {n \geq 1}\) be a finite or countable partition of \(\Omega\). Then, if \(A \in \mathcal A\), \[ P(A) = \sum_n P(A | E_n ) P(E_n) \]
Theorem (Bayes Theorem)
Let \((E_n) _ {n \geq 1}\) be a finite or countable partition of \(\Omega\), and suppose \(P(A) > 0\). Then, \[ P(E_n | A) = \frac{P(A | E_n) P(E_n)}{\sum_m P(A | E_m) P(E_m)} \]
For a single event \(E \in \Omega\), \[ P(E|A) = \frac{P(A|E) P(E)}{P(A)} \]
A random variable \(X\) on a probability space \((\Omega,\mathcal A, P)\) is a (measurable) mapping \(X : \Omega \to \mathbb{R}\) such that \[ \forall B \in \mathcal{B}(\mathbb{R}), \quad X^{-1}(B) \in \mathcal{A} \]
The measurability condition states that the inverse image is a measurable set of \(\Omega\) i.e. \(X^{-1}(B) \in \mathcal A\). This is essential since probabilities are defined only on \(\mathcal A\).
In words, a random variable it’s a mapping from events to real numbers such that each interval on the real line can be mapped back into an element of the sigma algebra (it can be the empty set).
Let \(X\) be a real valued random variable. The distribution function (also called cumulative distribution function) of \(X\), commonly denoted \(F_X(x)\) is defined by \[ F_X(x) = \Pr(X \leq x) \]
Properties
- \(F\) is monotone non-decreasing
- \(F\) is right continuous
- \(\lim _ {x \to - \infty} F(x)=0\) and \(\lim _ {x \to + \infty} F(x)=1\)
The random variables \((X_1, .. , X_n)\) are independent if and only if \[ F _ {(X_1, ... , X_n)} (x) = \prod _ {i=1}^n F_{X_i} (x_i) \quad \forall x \in \mathbb R^n \]
Let \(X\) be a real valued random variable. \(X\) has a probability density function if there exists \(f_X(x)\) such that for all measurable \(A \subset \mathbb{R}\), \[ P(X \in A) = \int_A f_X(x) \mathrm{d} x \]
The expected value of a random variable, when it exists, is given by \[ \mathbb{E}[ X ] = \int_ \Omega X(\omega) \mathrm{d} P \] When \(X\) has a density, then \[ \mathbb{E} [ X ] = \int_ \mathbb{R} x f_X (x) \mathrm{d} x = \int _ \mathbb{R} x \mathrm{d} F_X (x) \]
The empirical expectation (or sample average) is given by \[ \mathbb{E}_n [x_i] = \frac{1}{n} \sum _ {i=1}^N x_i \]
The covariance of two random variables \(X\), \(Y\) defined on \(\Omega\) is \[ Cov(X, Y ) = \mathbb{E}[ (X - \mathbb{E}[ X ]) (Y - \mathbb{E}[ Y ]) ] = \mathbb{E}[XY ] - \mathbb{E}[ X ]E[ Y ] \] In vector notation, \(Cov(X, Y) = \mathbb{E}[XY'] - \mathbb{E}[ X ]\mathbb{E}[Y']\).
The variance of a random variable \(X\), when it exists, is given by \[ Var(X) = \mathbb{E}[ (X - \mathbb{E}[ X ])^2 ] = \mathbb{E}[X^2] - \mathbb{E}[ X ]^2 \] In vector notation, \(Var(X) = \mathbb{E}[XX'] - \mathbb{E}[ X ]\mathbb{E}[X']\).
Let \(X, Y, Z, T \in \mathcal{L}^{2}\) and \(a, b, c, d \in \mathbb{R}\)
Let \(X, Y \in \mathcal L^1\) be independent. Then, \(\mathbb E[XY] = \mathbb E[ X ] \mathbb E[ Y ]\).
If \(X\) and \(Y\) are independent, then \(Cov(X,Y) = 0\).
Note that the converse does not hold: \(Cov(X,Y) = 0 \not \to X \perp Y\).
The sample variance is given by \[ Var_n (x_i) = \frac{1}{n} \sum _ {i=1}^N (x_i - \bar{x})^2 \] where \(\bar{x_i} = \mathbb{E}_n [x_i] = \frac{1}{n} \sum _ {i=1}^N x_i\).
Theorem: The expected sample variance \(\mathbb{E} [\sigma^2_n] = \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^N \left(y_i - \mathbb{E}_n[ Y ] \right)^2 \right]\) gives an estimate of the population variance that is biased by a factor of \(\frac{1}{n}\) and is therefore referred to as biased sample variance.
Proof: \[ \begin{aligned} &\mathbb{E}[\sigma^2_n] = \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^n \left( y_i - \mathbb{E}_n [ Y ] \right)^2 \right] = \newline &= \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^n \left( y_i - \frac{1}{n} \sum _ {i=1}^n y_i \right )^2 \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \mathbb{E} \left[ y_i^2 - \frac{2}{n} y_i \sum _ {j=1}^n y_j + \frac{1}{n^2} \sum _ {j=1}^n y_j \sum _ {k=1}^{n}y_k \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \left[ \frac{n-2}{n} \mathbb{E}[y_i^2] - \frac{2}{n} \sum _ {j\neq i} \mathbb{E}[y_i y_j] + \frac{1}{n^2} \sum _ {j=1}^n \sum _ {k\neq j} \mathbb{E}[y_j y_k] + \frac{1}{n^2} \sum _ {j=1}^n \mathbb{E}[y_j^2] \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \left[ \frac{n-2}{n}(\mu^2 + \sigma^2) - \frac{2}{n} (n-1) \mu^2 + \frac{1}{n^2} n(n-1)\mu^2 + \frac{1}{n^2} n (\mu^2 + \sigma^2)]\right] = \newline &= \frac{n-1}{n} \sigma^2 \end{aligned} \] \[\tag*{$\blacksquare$}\]
Triangle Inequality: if \(\mathbb{E} [ X ] < \infty\), then \[ |\mathbb{E} [ X ] | \leq \mathbb{E} [|X|] \]
Markov’s Inequality: if \(\mathbb{E}[ X ] < \infty\), then \[ \Pr(|X| > t) \leq \frac{1}{t} \mathbb{E}[|X|] \]
Chebyshev’s Inequality: if \(\mathbb{E}[X^2] < \infty\), then \[ \Pr(|X- \mu|> t \sigma) \leq \frac{1}{t^2}\Leftrightarrow \Pr(|X- \mu|> t ) \leq \frac{\sigma^2}{t^2} \]
Cauchy-Schwarz’s Inequality: \[ \mathbb{E} [|XY|] \leq \sqrt{\mathbb{E}[X^2] \mathbb{E}[Y^2]} \]
Minkowski Inequality: \[ \left( \sum _ {k=1}^n | x_k + y_k |^p \right) ^ {\frac{1}{p}} \leq \left( \sum _ {k=1}^n | x_k |^p \right) ^ {\frac{1}{p}} + \left( \sum _ {k=1}^n | y_k | ^p \right) ^ { \frac{1}{p} } \]
Jensen’s Inequality: if \(g( \cdot)\) is concave (e.g. logarithmic function), then \[ \mathbb{E}[g(x)] \leq g(\mathbb{E}[ X ]) \] Similarly, if \(g(\cdot)\) is convex (e.g. exponential function), then \[ \mathbb{E}[g(x)] \geq g(\mathbb{E}[ X ]) \]
Theorem (Law of Iterated Expectations) \[ \mathbb{E}(Y) = \mathbb{E}_X [\mathbb{E}(Y|X)] \] > This states that the expectation of the conditional expectation is the unconditional expectation. > > In other words the average of the conditional averages is the unconditional average.
Theorem (Law of Total Variance) \[ Var(Y) = Var_X (\mathbb{E}[Y |X]) + \mathbb{E}_X [Var(Y|X)] \]
Since variances are always non-negative, the law of total variance implies \[ Var(Y) \geq Var_X (\mathbb{E}[Y |X]) \]
We say that a random variable \(Z\) has the standard normal distribution, or Gaussian, written \(Z \sim N(0,1)\), if it has the density \[ \phi(x)=\frac{1}{\sqrt{2 \pi}} \exp \left(-\frac{x^{2}}{2}\right), \quad-\infty<x<\infty \] If \(Z \sim N(0, 1)\) and \(X = \mu + \sigma Z\) for \(\mu \in \mathbb R\) and \(\sigma \geq 0\), then \(X\) has a univariate normal distribution, written \(X \sim N(\mu, \sigma^2)\). By change-of-variables X has the density \[ f(x)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right), \quad-\infty<x<\infty \]
We say that the k -vector Z has a multivariate standard normal distribution, written \(Z \sim N(0, I_k)\) if it has the joint density \[ f(x)=\frac{1}{(2 \pi)^{k / 2}} \exp \left(-\frac{x^{\prime} x}{2}\right), \quad x \in \mathbb{R}^{k} \] If \(Z \sim N(0, I_k)\) and \(X = \mu + B Z\), then the k-vector \(X\) has a multivariate normal distribution, written \(X \sim N(\mu, \Sigma)\) where \(\Sigma = BB' \geq 0\). If \(\sigma > 0\), then by change-of-variables \(X\) has the joint density function \[ f(x)=\frac{1}{(2 \pi)^{k / 2} \operatorname{det}(\Sigma)^{1 / 2}} \exp \left(-\frac{(x-\mu)^{\prime} \Sigma^{-1}(x-\mu)}{2}\right), \quad x \in \mathbb{R}^{k} \]
These distributions are relatives of the normal distribution
The \(t\) distribution is approximately standard normal but has heavier tails. The approximation is good for \(n \geq 30\): \(t_{n\geq 30} \sim N(0,1)\)