Processing math: 26%

Probability Theory

Matteo Courthoud

2021-10-29

Probability

Probability Space

A probability space is a triple (Ω,A,P) where

  • Ω is the sample space.
  • A is the σ-algebra on Ω.
  • P is a probability measure.

The sample space Ω is the space of all possible events.

What is a σ-algebra and a probability measure?

Sigma Algebra

A nonempty set (of subsets of Ω) A2Ω is a sigma algebra (σ-algebra) of Ω if the following conditions hold:

  1. ΩA
  2. If AA, then (ΩA)A
  3. If A1,A2,...A, then i=1AiA

The smallest σ-algebra is {,Ω} and the largest one is 2Ω (in cardinality terms).

Suppose Ω=R. Let C={(a,b],a<b<}. Then the Borel σ- algebra on R is defined by B(R)=σ(C)

Probability Measure

A probability measure P:A[0,1] is a set function with domain A and codomain [0,1] such that

  1. P(A)0 AA
  2. P is σ-additive: is AnA are pairwise disjoint events (AjAk= for jk), then P(n=1An)=n=1P(An)
  3. P(Ω)=1

Properties

Some properties of probability measures

  • P(Ac)=1P(A)
  • P()=0
  • For A,BA, P(AB)=P(A)+P(B)P(AB)
  • For A,BA, if AB then P(A)P(B)
  • For AnA, P(n=1An)n=1P(An)
  • For AnA, if AnA then lim

Conditional Probability

Let A, B \in \mathcal A and P(B) > 0, the conditional probability of A given B is P(A | B)=\frac{P(A \cap B)}{P(B)}

Two events A and B are independent if P(A \cap B)=P(A) P(B).

Law of Total Probability

Theorem (Law of Total Probability)

Let (E_n) _ {n \geq 1} be a finite or countable partition of \Omega. Then, if A \in \mathcal A, P(A) = \sum_n P(A | E_n ) P(E_n)

Bayes Theorem

Theorem (Bayes Theorem)

Let (E_n) _ {n \geq 1} be a finite or countable partition of \Omega, and suppose P(A) > 0. Then, P(E_n | A) = \frac{P(A | E_n) P(E_n)}{\sum_m P(A | E_m) P(E_m)}

For a single event E \in \Omega, P(E|A) = \frac{P(A|E) P(E)}{P(A)}

Random Variables

Definition

A random variable X on a probability space (\Omega,\mathcal A, P) is a (measurable) mapping X : \Omega \to \mathbb{R} such that \forall B \in \mathcal{B}(\mathbb{R}), \quad X^{-1}(B) \in \mathcal{A}

The measurability condition states that the inverse image is a measurable set of \Omega i.e. X^{-1}(B) \in \mathcal A. This is essential since probabilities are defined only on \mathcal A.

In words, a random variable it’s a mapping from events to real numbers such that each interval on the real line can be mapped back into an element of the sigma algebra (it can be the empty set).

Distribution Function

Let X be a real valued random variable. The distribution function (also called cumulative distribution function) of X, commonly denoted F_X(x) is defined by F_X(x) = \Pr(X \leq x)

Properties

  • F is monotone non-decreasing
  • F is right continuous
  • \lim _ {x \to - \infty} F(x)=0 and \lim _ {x \to + \infty} F(x)=1

The random variables (X_1, .. , X_n) are independent if and only if F _ {(X_1, ... , X_n)} (x) = \prod _ {i=1}^n F_{X_i} (x_i) \quad \forall x \in \mathbb R^n

Density Function

Let X be a real valued random variable. X has a probability density function if there exists f_X(x) such that for all measurable A \subset \mathbb{R}, P(X \in A) = \int_A f_X(x) \mathrm{d} x

Moments

Expected Value

The expected value of a random variable, when it exists, is given by \mathbb{E}[ X ] = \int_ \Omega X(\omega) \mathrm{d} P When X has a density, then \mathbb{E} [ X ] = \int_ \mathbb{R} x f_X (x) \mathrm{d} x = \int _ \mathbb{R} x \mathrm{d} F_X (x)

The empirical expectation (or sample average) is given by \mathbb{E}_n [x_i] = \frac{1}{n} \sum _ {i=1}^N x_i

Variance and Covariance

The covariance of two random variables X, Y defined on \Omega is Cov(X, Y ) = \mathbb{E}[ (X - \mathbb{E}[ X ]) (Y - \mathbb{E}[ Y ]) ] = \mathbb{E}[XY ] - \mathbb{E}[ X ]E[ Y ] In vector notation, Cov(X, Y) = \mathbb{E}[XY'] - \mathbb{E}[ X ]\mathbb{E}[Y'].

The variance of a random variable X, when it exists, is given by Var(X) = \mathbb{E}[ (X - \mathbb{E}[ X ])^2 ] = \mathbb{E}[X^2] - \mathbb{E}[ X ]^2 In vector notation, Var(X) = \mathbb{E}[XX'] - \mathbb{E}[ X ]\mathbb{E}[X'].

Properties

Let X, Y, Z, T \in \mathcal{L}^{2} and a, b, c, d \in \mathbb{R}

  • Cov(X, X) = Var(X)
  • Cov(X, Y) = Cov(Y, X)
  • Cov(aX + b, Y) = a \ Cov(X,Y)
  • Cov(X+Z, Y) = Cov(X,Y) + Cov(Z,Y)
  • Cov(aX + bZ, cY + dT) = ac * Cov(X,Y) + ad * Cov(X,T) + bc * Cov(Z,Y) + bd * Cov(Z,T)

Let X, Y \in \mathcal L^1 be independent. Then, \mathbb E[XY] = \mathbb E[ X ] \mathbb E[ Y ].

If X and Y are independent, then Cov(X,Y) = 0.

Note that the converse does not hold: Cov(X,Y) = 0 \not \to X \perp Y.

Sample Variance

The sample variance is given by Var_n (x_i) = \frac{1}{n} \sum _ {i=1}^N (x_i - \bar{x})^2 where \bar{x_i} = \mathbb{E}_n [x_i] = \frac{1}{n} \sum _ {i=1}^N x_i.

Finite Sample Bias Theorem

Theorem: The expected sample variance \mathbb{E} [\sigma^2_n] = \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^N \left(y_i - \mathbb{E}_n[ Y ] \right)^2 \right] gives an estimate of the population variance that is biased by a factor of \frac{1}{n} and is therefore referred to as biased sample variance.

Proof: \begin{aligned} &\mathbb{E}[\sigma^2_n] = \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^n \left( y_i - \mathbb{E}_n [ Y ] \right)^2 \right] = \newline &= \mathbb{E} \left[ \frac{1}{n} \sum _ {i=1}^n \left( y_i - \frac{1}{n} \sum _ {i=1}^n y_i \right )^2 \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \mathbb{E} \left[ y_i^2 - \frac{2}{n} y_i \sum _ {j=1}^n y_j + \frac{1}{n^2} \sum _ {j=1}^n y_j \sum _ {k=1}^{n}y_k \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \left[ \frac{n-2}{n} \mathbb{E}[y_i^2] - \frac{2}{n} \sum _ {j\neq i} \mathbb{E}[y_i y_j] + \frac{1}{n^2} \sum _ {j=1}^n \sum _ {k\neq j} \mathbb{E}[y_j y_k] + \frac{1}{n^2} \sum _ {j=1}^n \mathbb{E}[y_j^2] \right] = \newline &= \frac{1}{n} \sum _ {i=1}^n \left[ \frac{n-2}{n}(\mu^2 + \sigma^2) - \frac{2}{n} (n-1) \mu^2 + \frac{1}{n^2} n(n-1)\mu^2 + \frac{1}{n^2} n (\mu^2 + \sigma^2)]\right] = \newline &= \frac{n-1}{n} \sigma^2 \end{aligned} \tag*{$\blacksquare$}

Inequalities

  • Triangle Inequality: if \mathbb{E} [ X ] < \infty, then |\mathbb{E} [ X ] | \leq \mathbb{E} [|X|]

  • Markov’s Inequality: if \mathbb{E}[ X ] < \infty, then \Pr(|X| > t) \leq \frac{1}{t} \mathbb{E}[|X|]

  • Chebyshev’s Inequality: if \mathbb{E}[X^2] < \infty, then \Pr(|X- \mu|> t \sigma) \leq \frac{1}{t^2}\Leftrightarrow \Pr(|X- \mu|> t ) \leq \frac{\sigma^2}{t^2}

  • Cauchy-Schwarz’s Inequality: \mathbb{E} [|XY|] \leq \sqrt{\mathbb{E}[X^2] \mathbb{E}[Y^2]}

  • Minkowski Inequality: \left( \sum _ {k=1}^n | x_k + y_k |^p \right) ^ {\frac{1}{p}} \leq \left( \sum _ {k=1}^n | x_k |^p \right) ^ {\frac{1}{p}} + \left( \sum _ {k=1}^n | y_k | ^p \right) ^ { \frac{1}{p} }

  • Jensen’s Inequality: if g( \cdot) is concave (e.g. logarithmic function), then \mathbb{E}[g(x)] \leq g(\mathbb{E}[ X ]) Similarly, if g(\cdot) is convex (e.g. exponential function), then \mathbb{E}[g(x)] \geq g(\mathbb{E}[ X ])

Law of Iterated Expectations

Theorem (Law of Iterated Expectations) \mathbb{E}(Y) = \mathbb{E}_X [\mathbb{E}(Y|X)] > This states that the expectation of the conditional expectation is the unconditional expectation. > > In other words the average of the conditional averages is the unconditional average.

Law of Total Variance

Theorem (Law of Total Variance) Var(Y) = Var_X (\mathbb{E}[Y |X]) + \mathbb{E}_X [Var(Y|X)]

Since variances are always non-negative, the law of total variance implies Var(Y) \geq Var_X (\mathbb{E}[Y |X])

Distributions

Normal Distribution

We say that a random variable Z has the standard normal distribution, or Gaussian, written Z \sim N(0,1), if it has the density \phi(x)=\frac{1}{\sqrt{2 \pi}} \exp \left(-\frac{x^{2}}{2}\right), \quad-\infty<x<\infty If Z \sim N(0, 1) and X = \mu + \sigma Z for \mu \in \mathbb R and \sigma \geq 0, then X has a univariate normal distribution, written X \sim N(\mu, \sigma^2). By change-of-variables X has the density f(x)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right), \quad-\infty<x<\infty

Multinomial Normal Distribution

We say that the k -vector Z has a multivariate standard normal distribution, written Z \sim N(0, I_k) if it has the joint density f(x)=\frac{1}{(2 \pi)^{k / 2}} \exp \left(-\frac{x^{\prime} x}{2}\right), \quad x \in \mathbb{R}^{k} If Z \sim N(0, I_k) and X = \mu + B Z, then the k-vector X has a multivariate normal distribution, written X \sim N(\mu, \Sigma) where \Sigma = BB' \geq 0. If \sigma > 0, then by change-of-variables X has the joint density function f(x)=\frac{1}{(2 \pi)^{k / 2} \operatorname{det}(\Sigma)^{1 / 2}} \exp \left(-\frac{(x-\mu)^{\prime} \Sigma^{-1}(x-\mu)}{2}\right), \quad x \in \mathbb{R}^{k}

Properties

  1. The expectation and covariance matrix of X \sim N(\mu, \Sigma) are \mathbb E [X] = \mu and Var[X]=\Sigma.
  2. If (X,Y) are multivariate normal, X and Y are uncorrelated if and only if they are independent.
  3. If X \sim N(\mu, \Sigma) and Y = a + bB, then X \sim N(a + B\mu, B \Sigma B').
  4. If X \sim N(0, I_k), then X'X \sim \chi^2_k, chi-square with k degrees of freedom.
  5. If X \sim N(0, \Sigma) with \Sigma>0, then X' \Sigma X \sim \chi_k where k = \dim (X).
  6. If Z \sim N(0,1) and Q \sim \chi^2_k are independent then \frac{Z}{\sqrt{Q/k}} \sim t_k, student t with k degrees of freedom.

Normal Distribution Relatives

These distributions are relatives of the normal distribution

  1. \chi^2_q \sim \sum _ {i=1}^q Z_i^2 where Z_i \sim N(0,1)
  2. t_n \sim \frac{Z}{\sqrt{\chi^2 _ n}/n }
  3. F(n_1 , n_2) \sim \frac{\chi^2 _ {n_1} / n_1}{\chi^2 _ {n_2}/n_2}

The t distribution is approximately standard normal but has heavier tails. The approximation is good for n \geq 30: t_{n\geq 30} \sim N(0,1)