Probability Theory

Last updated on Oct 29, 2021

Probability

Probability Space

A probability space is a triple $(\Omega, \mathcal A, P)$ where

$\Omega$ is the sample space.
$\mathcal A$ is the $\sigma$-algebra on $\Omega$.
$P$ is a probability measure.

The sample space $\Omega$ is the space of all possible events.

What is a $\sigma$-algebra and a probability measure?

Sigma Algebra

A nonempty set (of subsets of $\Omega$) $\mathcal A \in 2^\Omega$ is a sigma algebra ($\sigma$-algebra) of $\Omega$ if the following conditions hold:

$\Omega \in \mathcal A$
If $A \in \mathcal A$, then $(\Omega - A) \in \mathcal A$
If $A_1, A_2, … \in \mathcal A$, then $\bigcup _ {i=1}^{\infty} A_i \in \mathcal A$

The smallest $\sigma$-algebra is $\lbrace \emptyset, \Omega \rbrace$ and the largest one is $2^\Omega$ (in cardinality terms).

Suppose $\Omega = \mathbb R$. Let $\mathcal{C} = \lbrace (a, b],-\infty \leq a<b<\infty \rbrace$. Then the Borel $\sigma$- algebra on $\mathbb R$ is defined by $$ \mathcal B (\mathbb R) = \sigma (\mathcal C) $$

Probability Measure

A probability measure $P: \mathcal A \to [0,1]$ is a set function with domain $\mathcal A$ and codomain $[0,1]$ such that

$P(A) \geq 0 \ \forall A \in \mathcal A$
$P$ is $\sigma$-additive: is $A_n \in \mathcal A$ are pairwise disjoint events ($A_j \cap A_k = \emptyset$ for $j \neq k$), then $$ P\left(\bigcup _ {n=1}^{\infty} A_{n} \right)=\sum _ {n=1}^{\infty} P\left(A_{n}\right) $$
$P(\Omega) = 1$

Properties

Some properties of probability measures

$P\left(A^{c}\right)=1-P(A)$
$P(\emptyset)=0$
For $A, B \in \mathcal{A}$, $P(A \cup B)=P(A)+P(B)-P(A \cap B)$
For $A, B \in \mathcal{A}$, if $A \subset B$ then $P(A) \leq P(B)$
For $A_n \in \mathcal{A}$, $P \left(\cup _ {n=1}^\infty A_{n} \right) \leq \sum _ {n=1}^\infty P(A_n)$
For $A_n \in \mathcal{A}$, if $A_n \uparrow A$ then $\lim _ {n \to \infty} P(A_n) = P(A)$

Conditional Probability

Let $A, B \in \mathcal A$ and $P(B) > 0$, the conditional probability of $A$ given $B$ is $$ P(A | B)=\frac{P(A \cap B)}{P(B)} $$

Two events $A$ and $B$ are independent if $P(A \cap B)=P(A) P(B)$.

Law of Total Probability

Theorem (Law of Total Probability)

Let $(E_n) _ {n \geq 1}$ be a finite or countable partition of $\Omega$. Then, if $A \in \mathcal A$, $$ P(A) = \sum_n P(A | E_n ) P(E_n) $$

Bayes Theorem

Theorem (Bayes Theorem)

Let $(E_n) _ {n \geq 1}$ be a finite or countable partition of $\Omega$, and suppose $P(A) > 0$. Then, $$ P(E_n | A) = \frac{P(A | E_n) P(E_n)}{\sum_m P(A | E_m) P(E_m)} $$

For a single event $E \in \Omega$, $$ P(E|A) = \frac{P(A|E) P(E)}{P(A)} $$

Random Variables

Definition

A random variable $X$ on a probability space $(\Omega,\mathcal A, P)$ is a (measurable) mapping $X : \Omega \to \mathbb{R}$ such that $$ \forall B \in \mathcal{B}(\mathbb{R}), \quad X^{-1}(B) \in \mathcal{A} $$

The measurability condition states that the inverse image is a measurable set of $\Omega$ i.e. $X^{-1}(B) \in \mathcal A$. This is essential since probabilities are defined only on $\mathcal A$.

In words, a random variable it’s a mapping from events to real numbers such that each interval on the real line can be mapped back into an element of the sigma algebra (it can be the empty set).

Distribution Function

Let $X$ be a real valued random variable. The distribution function (also called cumulative distribution function) of $X$, commonly denoted $F_X(x)$ is defined by $$ F_X(x) = \Pr(X \leq x) $$

Properties

$F$ is monotone non-decreasing

$F$ is right continuous

$\lim _ {x \to - \infty} F(x)=0$ and $\lim _ {x \to + \infty} F(x)=1$

The random variables $(X_1, .. , X_n)$ are independent if and only if $$ F _ {(X_1, … , X_n)} (x) = \prod _ {i=1}^n F_{X_i} (x_i) \quad \forall x \in \mathbb R^n $$

Density Function

Let $X$ be a real valued random variable. $X$ has a probability density function if there exists $f_X(x)$ such that for all measurable $A \subset \mathbb{R}$, $$ P(X \in A) = \int_A f_X(x) \mathrm{d} x $$

Moments

Expected Value

The expected value of a random variable, when it exists, is given by $$ \mathbb{E}[ X ] = \int_ \Omega X(\omega) \mathrm{d} P $$ When $X$ has a density, then $$ \mathbb{E} [ X ] = \int_ \mathbb{R} x f_X (x) \mathrm{d} x = \int _ \mathbb{R} x \mathrm{d} F_X (x) $$

The empirical expectation (or sample average) is given by $$ \mathbb{E}_n [x_i] = \frac{1}{n} \sum _ {i=1}^N x_i $$

Variance and Covariance

The covariance of two random variables $X$, $Y$ defined on $\Omega$ is $$ Cov(X, Y ) = \mathbb{E}[ (X - \mathbb{E}[ X ]) (Y - \mathbb{E}[ Y ]) ] = \mathbb{E}[XY ] - \mathbb{E}[ X ]E[ Y ] $$ In vector notation, $Cov(X, Y) = \mathbb{E}[XY’] - \mathbb{E}[ X ]\mathbb{E}[Y’]$.

The variance of a random variable $X$, when it exists, is given by $$ Var(X) = \mathbb{E}[ (X - \mathbb{E}[ X ])^2 ] = \mathbb{E}[X^2] - \mathbb{E}[ X ]^2 $$ In vector notation, $Var(X) = \mathbb{E}[XX’] - \mathbb{E}[ X ]\mathbb{E}[X’]$.

Properties

Let $X, Y, Z, T \in \mathcal{L}^{2}$ and $a, b, c, d \in \mathbb{R}$

$Cov(X, X) = Var(X)$
$Cov(X, Y) = Cov(Y, X)$
$Cov(aX + b, Y) = a \ Cov(X,Y)$
$Cov(X+Z, Y) = Cov(X,Y) + Cov(Z,Y)$
$Cov(aX + bZ, cY + dT) = ac * Cov(X,Y) + ad * Cov(X,T) + bc * Cov(Z,Y) + bd * Cov(Z,T)$

Let $X, Y \in \mathcal L^1$ be independent. Then, $\mathbb E[XY] = \mathbb E[ X ] \mathbb E[ Y ]$.

If $X$ and $Y$ are independent, then $Cov(X,Y) = 0$.

Note that the converse does not hold: $Cov(X,Y) = 0 \not \to X \perp Y$.

Sample Variance

The sample variance is given by $$ Var_n (x_i) = \frac{1}{n} \sum _ {i=1}^N (x_i - \bar{x})^2 $$ where $\bar{x_i} = \mathbb{E}_n [x_i] = \frac{1}{n} \sum _ {i=1}^N x_i$.

Finite Sample Bias Theorem

Theorem: The expected sample variance $\mathbb{E} [\sigma^2_n] = \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^N \left(y_i - \mathbb{E}_n[ Y ] \right)^2 \right]$ gives an estimate of the population variance that is biased by a factor of $\frac{1}{n}$ and is therefore referred to as biased sample variance.

Proof: $$ \begin{aligned} &\mathbb{E}[\sigma^2_n] = \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^n \left( y_i - \mathbb{E}_n [ Y ] \right)^2 \right] = \newline &= \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^n \left( y_i - \frac{1}{n} \sum _ {i=1}^n y_i \right )^2 \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \mathbb{E} \left[ y_i^2 - \frac{2}{n} y_i \sum _ {j=1}^n y_j + \frac{1}{n^2} \sum _ {j=1}^n y_j \sum _ {k=1}^{n}y_k \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \left[ \frac{n-2}{n} \mathbb{E}[y_i^2] - \frac{2}{n} \sum _ {j\neq i} \mathbb{E}[y_i y_j] + \frac{1}{n^2} \sum _ {j=1}^n \sum _ {k\neq j} \mathbb{E}[y_j y_k] + \frac{1}{n^2} \sum _ {j=1}^n \mathbb{E}[y_j^2] \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \left[ \frac{n-2}{n}(\mu^2 + \sigma^2) - \frac{2}{n} (n-1) \mu^2 + \frac{1}{n^2} n(n-1)\mu^2 + \frac{1}{n^2} n (\mu^2 + \sigma^2)]\right] = \newline &= \frac{n-1}{n} \sigma^2 \end{aligned} $$ $$\tag*{$\blacksquare$}$$

Inequalities

Triangle Inequality: if $\mathbb{E} [ X ] < \infty$, then $$ |\mathbb{E} [ X ] | \leq \mathbb{E} [|X|] $$
Markov’s Inequality: if $\mathbb{E}[ X ] < \infty$, then $$ \Pr(|X| > t) \leq \frac{1}{t} \mathbb{E}[|X|] $$
Chebyshev’s Inequality: if $\mathbb{E}[X^2] < \infty$, then $$ \Pr(|X- \mu|> t \sigma) \leq \frac{1}{t^2}\Leftrightarrow \Pr(|X- \mu|> t ) \leq \frac{\sigma^2}{t^2} $$
Cauchy-Schwarz’s Inequality: $$ \mathbb{E} [|XY|] \leq \sqrt{\mathbb{E}[X^2] \mathbb{E}[Y^2]} $$
Minkowski Inequality: $$ \left( \sum _ {k=1}^n | x_k + y_k |^p \right) ^ {\frac{1}{p}} \leq \left( \sum _ {k=1}^n | x_k |^p \right) ^ {\frac{1}{p}} + \left( \sum _ {k=1}^n | y_k | ^p \right) ^ { \frac{1}{p} } $$
Jensen’s Inequality: if $g( \cdot)$ is concave (e.g. logarithmic function), then $$ \mathbb{E}[g(x)] \leq g(\mathbb{E}[ X ]) $$ Similarly, if $g(\cdot)$ is convex (e.g. exponential function), then $$ \mathbb{E}[g(x)] \geq g(\mathbb{E}[ X ]) $$

Law of Iterated Expectations

Theorem (Law of Iterated Expectations) $$ \mathbb{E}(Y) = \mathbb{E}_X [\mathbb{E}(Y|X)] $$ > This states that the expectation of the conditional expectation is the unconditional expectation. > > In other words the average of the conditional averages is the unconditional average.

Law of Total Variance

Theorem (Law of Total Variance) $$ Var(Y) = Var_X (\mathbb{E}[Y |X]) + \mathbb{E}_X [Var(Y|X)] $$

Since variances are always non-negative, the law of total variance implies $$ Var(Y) \geq Var_X (\mathbb{E}[Y |X]) $$

Distributions

Normal Distribution

We say that a random variable $Z$ has the standard normal distribution, or Gaussian, written $Z \sim N(0,1)$, if it has the density $$ \phi(x)=\frac{1}{\sqrt{2 \pi}} \exp \left(-\frac{x^{2}}{2}\right), \quad-\infty<x<\infty $$ If $Z \sim N(0, 1)$ and $X = \mu + \sigma Z$ for $\mu \in \mathbb R$ and $\sigma \geq 0$, then $X$ has a univariate normal distribution, written $X \sim N(\mu, \sigma^2)$. By change-of-variables X has the density $$ f(x)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right), \quad-\infty<x<\infty $$

Multinomial Normal Distribution

We say that the k -vector Z has a multivariate standard normal distribution, written $Z \sim N(0, I_k)$ if it has the joint density $$ f(x)=\frac{1}{(2 \pi)^{k / 2}} \exp \left(-\frac{x^{\prime} x}{2}\right), \quad x \in \mathbb{R}^{k} $$ If $Z \sim N(0, I_k)$ and $X = \mu + B Z$, then the k-vector $X$ has a multivariate normal distribution, written $X \sim N(\mu, \Sigma)$ where $\Sigma = BB’ \geq 0$. If $\sigma > 0$, then by change-of-variables $X$ has the joint density function $$ f(x)=\frac{1}{(2 \pi)^{k / 2} \operatorname{det}(\Sigma)^{1 / 2}} \exp \left(-\frac{(x-\mu)^{\prime} \Sigma^{-1}(x-\mu)}{2}\right), \quad x \in \mathbb{R}^{k} $$

Properties

The expectation and covariance matrix of $X \sim N(\mu, \Sigma)$ are $\mathbb E = \mu$ and $Var =\Sigma$.
If $(X,Y)$ are multivariate normal, $X$ and $Y$ are uncorrelated if and only if they are independent.
If $X \sim N(\mu, \Sigma)$ and $Y = a + bB$, then $X \sim N(a + B\mu, B \Sigma B’)$.
If $X \sim N(0, I_k)$, then $X’X \sim \chi^2_k$, chi-square with $k$ degrees of freedom.
If $X \sim N(0, \Sigma)$ with $\Sigma>0$, then $X’ \Sigma X \sim \chi_k$ where $k = \dim (X)$.
If $Z \sim N(0,1)$ and $Q \sim \chi^2_k$ are independent then $\frac{Z}{\sqrt{Q/k}} \sim t_k$, student t with k degrees of freedom.

Normal Distribution Relatives

These distributions are relatives of the normal distribution

$\chi^2_q \sim \sum _ {i=1}^q Z_i^2$ where $Z_i \sim N(0,1)$
$t_n \sim \frac{Z}{\sqrt{\chi^2 _ n}/n }$
$F(n_1 , n_2) \sim \frac{\chi^2 _ {n_1} / n_1}{\chi^2 _ {n_2}/n_2}$

The $t$ distribution is approximately standard normal but has heavier tails. The approximation is good for $n \geq 30$: $t_{n\geq 30} \sim N(0,1)$