Probability Theory
Last updated on Oct 29, 2021
Probability
Probability Space
A probability space is a triple $(\Omega, \mathcal A, P)$ where
- $\Omega$ is the sample space.
- $\mathcal A$ is the $\sigma$-algebra on $\Omega$.
- $P$ is a probability measure.
The sample space $\Omega$ is the space of all possible events.
What is a $\sigma$-algebra and a probability measure?
Sigma Algebra
A nonempty set (of subsets of $\Omega$) $\mathcal A \in 2^\Omega$ is a sigma algebra ($\sigma$-algebra) of $\Omega$ if the following conditions hold:
- $\Omega \in \mathcal A$
- If $A \in \mathcal A$, then $(\Omega - A) \in \mathcal A$
- If $A_1, A_2, … \in \mathcal A$, then $\bigcup _ {i=1}^{\infty} A_i \in \mathcal A$
The smallest $\sigma$-algebra is $\lbrace \emptyset, \Omega \rbrace$ and the largest one is $2^\Omega$ (in cardinality terms).
Suppose $\Omega = \mathbb R$. Let $\mathcal{C} = \lbrace (a, b],-\infty \leq a<b<\infty \rbrace$. Then the Borel $\sigma$- algebra on $\mathbb R$ is defined by $$ \mathcal B (\mathbb R) = \sigma (\mathcal C) $$
Probability Measure
A probability measure $P: \mathcal A \to [0,1]$ is a set function with domain $\mathcal A$ and codomain $[0,1]$ such that
- $P(A) \geq 0 \ \forall A \in \mathcal A$
- $P$ is $\sigma$-additive: is $A_n \in \mathcal A$ are pairwise disjoint events ($A_j \cap A_k = \emptyset$ for $j \neq k$), then $$ P\left(\bigcup _ {n=1}^{\infty} A_{n} \right)=\sum _ {n=1}^{\infty} P\left(A_{n}\right) $$
- $P(\Omega) = 1$
Properties
Some properties of probability measures
- $P\left(A^{c}\right)=1-P(A)$
- $P(\emptyset)=0$
- For $A, B \in \mathcal{A}$, $P(A \cup B)=P(A)+P(B)-P(A \cap B)$
- For $A, B \in \mathcal{A}$, if $A \subset B$ then $P(A) \leq P(B)$
- For $A_n \in \mathcal{A}$, $P \left(\cup _ {n=1}^\infty A_{n} \right) \leq \sum _ {n=1}^\infty P(A_n)$
- For $A_n \in \mathcal{A}$, if $A_n \uparrow A$ then $\lim _ {n \to \infty} P(A_n) = P(A)$
Conditional Probability
Let $A, B \in \mathcal A$ and $P(B) > 0$, the conditional probability of $A$ given $B$ is $$ P(A | B)=\frac{P(A \cap B)}{P(B)} $$
Two events $A$ and $B$ are independent if $P(A \cap B)=P(A) P(B)$.
Law of Total Probability
Theorem (Law of Total Probability)
Let $(E_n) _ {n \geq 1}$ be a finite or countable partition of $\Omega$. Then, if $A \in \mathcal A$, $$ P(A) = \sum_n P(A | E_n ) P(E_n) $$
Bayes Theorem
Theorem (Bayes Theorem)
Let $(E_n) _ {n \geq 1}$ be a finite or countable partition of $\Omega$, and suppose $P(A) > 0$. Then, $$ P(E_n | A) = \frac{P(A | E_n) P(E_n)}{\sum_m P(A | E_m) P(E_m)} $$
For a single event $E \in \Omega$, $$ P(E|A) = \frac{P(A|E) P(E)}{P(A)} $$
Random Variables
Definition
A random variable $X$ on a probability space $(\Omega,\mathcal A, P)$ is a (measurable) mapping $X : \Omega \to \mathbb{R}$ such that $$ \forall B \in \mathcal{B}(\mathbb{R}), \quad X^{-1}(B) \in \mathcal{A} $$
The measurability condition states that the inverse image is a measurable set of $\Omega$ i.e. $X^{-1}(B) \in \mathcal A$. This is essential since probabilities are defined only on $\mathcal A$.
In words, a random variable it’s a mapping from events to real numbers such that each interval on the real line can be mapped back into an element of the sigma algebra (it can be the empty set).
Distribution Function
Let $X$ be a real valued random variable. The distribution function (also called cumulative distribution function) of $X$, commonly denoted $F_X(x)$ is defined by $$ F_X(x) = \Pr(X \leq x) $$
Properties
- $F$ is monotone non-decreasing
- $F$ is right continuous
- $\lim _ {x \to - \infty} F(x)=0$ and $\lim _ {x \to + \infty} F(x)=1$
The random variables $(X_1, .. , X_n)$ are independent if and only if $$ F _ {(X_1, … , X_n)} (x) = \prod _ {i=1}^n F_{X_i} (x_i) \quad \forall x \in \mathbb R^n $$
Density Function
Let $X$ be a real valued random variable. $X$ has a probability density function if there exists $f_X(x)$ such that for all measurable $A \subset \mathbb{R}$, $$ P(X \in A) = \int_A f_X(x) \mathrm{d} x $$
Moments
Expected Value
The expected value of a random variable, when it exists, is given by $$ \mathbb{E}[ X ] = \int_ \Omega X(\omega) \mathrm{d} P $$ When $X$ has a density, then $$ \mathbb{E} [ X ] = \int_ \mathbb{R} x f_X (x) \mathrm{d} x = \int _ \mathbb{R} x \mathrm{d} F_X (x) $$
The empirical expectation (or sample average) is given by $$ \mathbb{E}_n [x_i] = \frac{1}{n} \sum _ {i=1}^N x_i $$
Variance and Covariance
The covariance of two random variables $X$, $Y$ defined on $\Omega$ is $$ Cov(X, Y ) = \mathbb{E}[ (X - \mathbb{E}[ X ]) (Y - \mathbb{E}[ Y ]) ] = \mathbb{E}[XY ] - \mathbb{E}[ X ]E[ Y ] $$ In vector notation, $Cov(X, Y) = \mathbb{E}[XY’] - \mathbb{E}[ X ]\mathbb{E}[Y’]$.
The variance of a random variable $X$, when it exists, is given by $$ Var(X) = \mathbb{E}[ (X - \mathbb{E}[ X ])^2 ] = \mathbb{E}[X^2] - \mathbb{E}[ X ]^2 $$ In vector notation, $Var(X) = \mathbb{E}[XX’] - \mathbb{E}[ X ]\mathbb{E}[X’]$.
Properties
Let $X, Y, Z, T \in \mathcal{L}^{2}$ and $a, b, c, d \in \mathbb{R}$
- $Cov(X, X) = Var(X)$
- $Cov(X, Y) = Cov(Y, X)$
- $Cov(aX + b, Y) = a \ Cov(X,Y)$
- $Cov(X+Z, Y) = Cov(X,Y) + Cov(Z,Y)$
- $Cov(aX + bZ, cY + dT) = ac * Cov(X,Y) + ad * Cov(X,T) + bc * Cov(Z,Y) + bd * Cov(Z,T)$
Let $X, Y \in \mathcal L^1$ be independent. Then, $\mathbb E[XY] = \mathbb E[ X ] \mathbb E[ Y ]$.
If $X$ and $Y$ are independent, then $Cov(X,Y) = 0$.
Note that the converse does not hold: $Cov(X,Y) = 0 \not \to X \perp Y$.
Sample Variance
The sample variance is given by $$ Var_n (x_i) = \frac{1}{n} \sum _ {i=1}^N (x_i - \bar{x})^2 $$ where $\bar{x_i} = \mathbb{E}_n [x_i] = \frac{1}{n} \sum _ {i=1}^N x_i$.
Finite Sample Bias Theorem
Theorem: The expected sample variance $\mathbb{E} [\sigma^2_n] = \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^N \left(y_i - \mathbb{E}_n[ Y ] \right)^2 \right]$ gives an estimate of the population variance that is biased by a factor of $\frac{1}{n}$ and is therefore referred to as biased sample variance.
Proof: $$ \begin{aligned} &\mathbb{E}[\sigma^2_n] = \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^n \left( y_i - \mathbb{E}_n [ Y ] \right)^2 \right] = \newline &= \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^n \left( y_i - \frac{1}{n} \sum _ {i=1}^n y_i \right )^2 \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \mathbb{E} \left[ y_i^2 - \frac{2}{n} y_i \sum _ {j=1}^n y_j + \frac{1}{n^2} \sum _ {j=1}^n y_j \sum _ {k=1}^{n}y_k \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \left[ \frac{n-2}{n} \mathbb{E}[y_i^2] - \frac{2}{n} \sum _ {j\neq i} \mathbb{E}[y_i y_j] + \frac{1}{n^2} \sum _ {j=1}^n \sum _ {k\neq j} \mathbb{E}[y_j y_k] + \frac{1}{n^2} \sum _ {j=1}^n \mathbb{E}[y_j^2] \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \left[ \frac{n-2}{n}(\mu^2 + \sigma^2) - \frac{2}{n} (n-1) \mu^2 + \frac{1}{n^2} n(n-1)\mu^2 + \frac{1}{n^2} n (\mu^2 + \sigma^2)]\right] = \newline &= \frac{n-1}{n} \sigma^2 \end{aligned} $$ $$\tag*{$\blacksquare$}$$
Inequalities
-
Triangle Inequality: if $\mathbb{E} [ X ] < \infty$, then $$ |\mathbb{E} [ X ] | \leq \mathbb{E} [|X|] $$
-
Markov’s Inequality: if $\mathbb{E}[ X ] < \infty$, then $$ \Pr(|X| > t) \leq \frac{1}{t} \mathbb{E}[|X|] $$
-
Chebyshev’s Inequality: if $\mathbb{E}[X^2] < \infty$, then $$ \Pr(|X- \mu|> t \sigma) \leq \frac{1}{t^2}\Leftrightarrow \Pr(|X- \mu|> t ) \leq \frac{\sigma^2}{t^2} $$
-
Cauchy-Schwarz’s Inequality: $$ \mathbb{E} [|XY|] \leq \sqrt{\mathbb{E}[X^2] \mathbb{E}[Y^2]} $$
-
Minkowski Inequality: $$ \left( \sum _ {k=1}^n | x_k + y_k |^p \right) ^ {\frac{1}{p}} \leq \left( \sum _ {k=1}^n | x_k |^p \right) ^ {\frac{1}{p}} + \left( \sum _ {k=1}^n | y_k | ^p \right) ^ { \frac{1}{p} } $$
-
Jensen’s Inequality: if $g( \cdot)$ is concave (e.g. logarithmic function), then $$ \mathbb{E}[g(x)] \leq g(\mathbb{E}[ X ]) $$ Similarly, if $g(\cdot)$ is convex (e.g. exponential function), then $$ \mathbb{E}[g(x)] \geq g(\mathbb{E}[ X ]) $$
Law of Iterated Expectations
Theorem (Law of Iterated Expectations) $$ \mathbb{E}(Y) = \mathbb{E}_X [\mathbb{E}(Y|X)] $$ > This states that the expectation of the conditional expectation is the unconditional expectation. > > In other words the average of the conditional averages is the unconditional average.
Law of Total Variance
Theorem (Law of Total Variance) $$ Var(Y) = Var_X (\mathbb{E}[Y |X]) + \mathbb{E}_X [Var(Y|X)] $$
Since variances are always non-negative, the law of total variance implies $$ Var(Y) \geq Var_X (\mathbb{E}[Y |X]) $$
Distributions
Normal Distribution
We say that a random variable $Z$ has the standard normal distribution, or Gaussian, written $Z \sim N(0,1)$, if it has the density $$ \phi(x)=\frac{1}{\sqrt{2 \pi}} \exp \left(-\frac{x^{2}}{2}\right), \quad-\infty<x<\infty $$ If $Z \sim N(0, 1)$ and $X = \mu + \sigma Z$ for $\mu \in \mathbb R$ and $\sigma \geq 0$, then $X$ has a univariate normal distribution, written $X \sim N(\mu, \sigma^2)$. By change-of-variables X has the density $$ f(x)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right), \quad-\infty<x<\infty $$
Multinomial Normal Distribution
We say that the k -vector Z has a multivariate standard normal distribution, written $Z \sim N(0, I_k)$ if it has the joint density $$ f(x)=\frac{1}{(2 \pi)^{k / 2}} \exp \left(-\frac{x^{\prime} x}{2}\right), \quad x \in \mathbb{R}^{k} $$ If $Z \sim N(0, I_k)$ and $X = \mu + B Z$, then the k-vector $X$ has a multivariate normal distribution, written $X \sim N(\mu, \Sigma)$ where $\Sigma = BB’ \geq 0$. If $\sigma > 0$, then by change-of-variables $X$ has the joint density function $$ f(x)=\frac{1}{(2 \pi)^{k / 2} \operatorname{det}(\Sigma)^{1 / 2}} \exp \left(-\frac{(x-\mu)^{\prime} \Sigma^{-1}(x-\mu)}{2}\right), \quad x \in \mathbb{R}^{k} $$
Properties
- The expectation and covariance matrix of $X \sim N(\mu, \Sigma)$ are $\mathbb E = \mu$ and $Var =\Sigma$.
- If $(X,Y)$ are multivariate normal, $X$ and $Y$ are uncorrelated if and only if they are independent.
- If $X \sim N(\mu, \Sigma)$ and $Y = a + bB$, then $X \sim N(a + B\mu, B \Sigma B’)$.
- If $X \sim N(0, I_k)$, then $X’X \sim \chi^2_k$, chi-square with $k$ degrees of freedom.
- If $X \sim N(0, \Sigma)$ with $\Sigma>0$, then $X’ \Sigma X \sim \chi_k$ where $k = \dim (X)$.
- If $Z \sim N(0,1)$ and $Q \sim \chi^2_k$ are independent then $\frac{Z}{\sqrt{Q/k}} \sim t_k$, student t with k degrees of freedom.
Normal Distribution Relatives
These distributions are relatives of the normal distribution
- $\chi^2_q \sim \sum _ {i=1}^q Z_i^2$ where $Z_i \sim N(0,1)$
- $t_n \sim \frac{Z}{\sqrt{\chi^2 _ n}/n }$
- $F(n_1 , n_2) \sim \frac{\chi^2 _ {n_1} / n_1}{\chi^2 _ {n_2}/n_2}$
The $t$ distribution is approximately standard normal but has heavier tails. The approximation is good for $n \geq 30$: $t_{n\geq 30} \sim N(0,1)$