Probabilistic Reasoning

Probability is an extension of the rules of logic to deal with uncertain events. Instead of the absolute ‘true’ or ‘false’ of logic, it uses estimation to aid plausible reasoning. The first step in any problem involving probability is to specify what outcomes are possible and their a priori probability of occuring. A probability is a number between 0 and 1 representing a degree of belief. All probabilities are conditional on available information and can be systematically updated using Bayes’ theorem as more information becomes available.

Origins

In 1654, Antoine Gombaud (aka le chevalier de Méré) asked Blaise Pascal “Is it worthwhile betting even money that double sixes will turn up at least once in 24 throws of a fair pair of dice?” Based on empirical data he believed that the probability was greater than rolling at least one 6 in 4 throws. The probability of not throwing a 6 is 5/6 so the probability of not throwing a 6 in 4 throws is (5/6)^4. The probability of pair of dice not having a pair of 6’s is (35/36) so the probability of not throwing a pair of 6’s in 24 throws is (35/36)^{24}. This shows the answer to De Méré’s first question is 1 - (35/36)^{24}\approx .491 and since 1 - (5/6)^4 \approx .518 we have shown his intuition was wrong.

In this case the sample space is 24 rolls of a pair of dice, \Omega = (D\times D)^24 where D = \{1,2,3,4,5,6\}.

…

P(A\mid B) = P(AB)/P(B)

P(A\mid B) = P(A)P(B\mid A)/P(B)

\Omega = \{\text{heads},\text{tails}\}. P(\text{heads}) = p, P(\text{tails}) = 1 - p for some p\in[0, 1]. Define X\colon\Omega\to\mathbf{R} by X(\text{heads}) = 1 and X(\text{tails}) = 0.

Exercise. Show P(X = x) = p^x(1 - p)^{1 - x}.

We could let \Omega = \{0,1\} with P(\{0\}) = 1 - p, P(\{1\}) = p, and define X(\omega) = \omega.

What if we don’t know p? How can coin flips give us information about its value?

Let \Omega = \{0,1\}\times[0,1] and X\colon\Omega\to\mathbf{R} by X(x,\theta) = x.

P(X = x) = P(\{x\}\times[0,1])

P(X = x) = p^x(1-p)^{1 - x}, where p = P(X = 1) so 1 - p = P(X = 0).

P(\Theta\le\theta) = P(\{0,1\}\times [0,\theta]).

P(X = x) = \Theta^x (1 - \Theta)^{1 - x} is a random variable (???)

P(X = x) means P(X = x\mid\Theta).

P(\Theta\le\theta|X = x) = P(\Theta\le\theta)P(X = x\mid \Theta\le\theta)/P(X = x)

Suppose P(\Theta \le \theta) = 1_{[p,\infty)}(\theta), i.e., \Theta = p with probability 1.

Claim P(X = x) = E[\Theta]^x(1 - E[\Theta])^{1 - x} = p^x(1 - p)^{1 - x}.

P(X = x\mid \Theta\le\theta) = ?

P(X = 0\mid \Theta\le\theta) = \int_0^1 (1 - q)\,d1_{[p,\infty)}(q)

P(X = 1\mid \Theta\le\theta) = \int_0^1 q\,d1_{[p,\infty)}(q)

Probability Space

A sample space Ω is the set of what can happen. An outcome ω\in Ω is an element of a sample space. An event E\subseteq Ω is a subset of a sample space. A probability measure, P, is a function from events to [0,1] satisfying P(E\cup F) = P(E) + P(F) - P(E\cap F) and P(\emptyset) = 0. A measure does not count things twice and the measure of nothing is 0.

Exercise. Show P(E\cup F) = P(E) + P(F) if E\cap F = \emptyset.

It was a major event in our world when Kolomogorov legitimized previous results in probability theory that rascals and scoundrels figured out by hook or by crook to win at gambling. Measure Theory was developed by Lesbegue to generalise Riemann-Stieltjes integration. Mathematicans found probabilty theory more legitimate when Kolomogorov pointed out a probability measure is simply a positive measure having mass 1.

Partial information about outcomes in \Omega is modeled by a partition: a collection of pairwise disjoint subsets whose union is \Omega. Elements of the partition are atoms. Complete information corresponds to the finest partition of singletons {\{\{ω\}\mid ω\in Ω\}}. No information corresponds to the coarsest partition \{Ω\}. Partial information is knowing only which atom an outcome belongs to.

An algebra \mathcal{A} is collection of events that is closed under set complement and union.

The complement of the event E\subseteq Ω, \neg E, the set of outcomes ‘not’ in E. The union of events, E\cup F, is the set of outcomes that belong to E ‘or’ F.

Exercise. Show if \mathcal{A} is closed under complement and union then so is \mathcal{A}\cup\{\emptyset, Ω\}.

It is convenient to assume the empty set belongs to the algebra.

Exercise. Show Ω\in\mathcal{A} and \mathcal{A} is closed under set intersection.

Hint: Use \emptyset\in\mathcal{A} and De Morgan’s laws.

Solution

The set complement of A\subseteq Ω is \neg A = \{ω\in Ω\mid ω\not\in A\} so \neg \emptyset = Ω\in\mathcal{A}. Since \neg(A\cap B) = \neg A \cup\neg B we have A\cap B\in\mathcal{A}.

If an algebra is also closed under countable unions of events then is it a σ-algebra. This means if E_n\in\mathcal{A}, n\in\mathbf{N}, then \cup_{n\in\mathbf{N}} E_n\in\mathcal{A}, where \mathbf{N}= \{0, 1, 2, \ldots\} is the set of natural numbers.

Exercise. If an algebra is closed under countable unions then it is also closed under countable intersections.

Algebras model partial information.

Partition

If \mathcal{A} is finite then its atoms are a partition of the sample space and \mathcal{A} is generated by its atoms.

An atom is an event A\in\mathcal{A} where B\subseteq A for some B\in\mathcal{A} implies B=\emptyset or B=A. If \mathcal{A} is finite and ω\in Ω define {A_ω = \cap\{A\in\mathcal{A}\mid ω\in A\}}.

Exercise. Show A_ω is an atom of \mathcal{A} containing ω\in Ω.

Excercise. Show \{A_ω\mid ω\in Ω\} is a partition of Ω.

Hint: Since ω\in A_ω, ω\in Ω, the union is Ω. Show either {A_ω \cap A_{ω'} = \emptyset} or A_ω = A_{ω'}, ω,ω'\in Ω.

If \mathcal{A} is finite then we can identify it with its atoms.

Coin tossing

We can model a sequence of random coin flips by a sequence of 0’s and 1’s where 0 corresponds to heads and 1 corresponds to tails. (Or vice versa.) If ω_j\in\{0,1\} is the j-th flip define {ω = \sum_{j=1}^\infty ω_j/2^j}.

Exercise. Show ω\in [0,1).

Exercise. Show if ω_1 = 0 then ω\in [0,1/2) and if ω_1 = 1 then ω\in [1/2,1).

This shows the partition \mathcal{A}_1 = \{[0,1/2), [1/2,1)\} of [0, 1) represents knowing the first base 2 digit of ω. Similarly, the partition \mathcal{A}_2 = \{[0,1/4), [1/4, 1/2), [1/2, 3/4), [3/4,1)\} represents knowing the first two base 2 digits of ω.

Exercise. Show the partition \mathcal{A}_n = \{[j/2^n, (j + 1)/2^n)\mid 0\le j < 2^n\} of [0, 1) represents knowing the first n base 2 digits of ω\in [0,1).

Measure

A measure is a set function μ\colon\mathcal{A}\to\mathbf{R} that satisfies μ(E\cup F) = μ(E) + μ(F) - μ(E\cap F) for E,F\in\mathcal{A}. Measures do not count things twice. We also assume μ(\emptyset) = 0. The measure of nothing is 0.

Exercise Show μ(E\cup F) = μ(E) + μ(F) if E\cap F=\emptyset.

A probability measure P is a measure with P(E)\ge0 for all E\in\mathcal{A} and P(Ω) = 1.

Discrete

If Ω = \{ω_j\} is finite (or countable) we can define a discrete probability measure by P(\{ω_j\}) = p_j where p_j > 0 and \sum_j p_j = 1.

Exercise. Show P(E) = \sum_{ω_j\in E} p_j.

Hint. E is the disjoint union of singletons \{ω_j\} where ω_j\in E.

Uniform

The uniform measure on Ω = [0,1) is defined by λ([a,b)) = b - a for 0\le a\le b < 1.

Conditional Expectation

The conditional expectation of an event B given an event A is {P(B\mid A) = P(B\cap A)/P(A)}.

Exercise. Show P_A(B) = P(B\cap A)/P(A) is a probability measure on A.

Hint: Show P_A(A) = 1 and P_A(B\cup C) = P_A(B) + P_A(C) - P_A(B\cap C).

Exercise. Show P(B\mid A) = P(B) P(A\mid B)/P(A).

This is the simplest form of Bayes Theorem. It shows how to update your degree of belief based on new information. Every probability is conditional on information.

We say B is independent of A if P(B\mid A) = P(B).

Exercise. Show B is independent of A if and only if P(A\cap B) = P(A)P(B).

Exercise. Show B is independent of A if and only if A is independent of B.

We also say A and B are independent.

Example

Suppose a family moves in next door and you are told they have two children. If you step on a GI Joe doll in their yard on your way to work what is the probability they are both boys?

The first step is to establish the sample space and the probability measure. We assume \Omega = \{FF, FM, MF, MM\} represents the female or male gender of the younger and older child and that each possibility is equally likely.

The event “step on a GI Joe” doll corresponds to B = \{FM, MF, MM\} indicating at least one of the children is a boy. Bayes’ theorem implies P(\{MM\}\mid B) = P(\{MM\})/P(B) = (1/4)/(3/4) = 1/3, not 1/2.

As in every model, there are assumptions. It may not be the case female and male children are equally likely. If p is the probabilty of a child being male then {P(\{MM\}\mid B) = p^2/(p(1-p) + (1-p)p + p^2) = p/(2 - p)}. If p = 1/2 then p/(2- p) = 1/3.

Exercise. What if p = 0 or p = 1?

This assumes the probability of each child being male or female is independent of the order of having children. This does not hold, e.g., in counties where parents kill their first child if it is female.

The assumption of stumbling across a GI Joe, or a Barbie, doll implying one of the children is or is not male may also not be valid.

Probability Theory can still be applied, it is just a matter of extending the sample space and finding an appropriate probability measure.

Random Variable

Random variables are symbols that can be used in place of a number when manipulating equations and inequalities. The cumulative distribution function F of the random variable X is F(x) = P(X \le x), the probability X is not greater than x. The cdf tells you everything there is to know about the probability of the values a random variable can take on.

Exercise. If P is discrete and X is the identity function on Ω then P(X \le x) = \sum_{x_j\le x} p_j.

Note F is piece-wise constant, non-decreasing, and right-continuous.

Exercise. If X is uniformly distributed on [0,1) then F(x) = \max\{0,\min\{1, x\}\} for -\infty < x < \infty.

Hint: If x < 0 then F(x) = 0 and if x \ge 1 then F(x) = 1.

The mathematical definition of a random variable is an \mathcal{A}-measurable function X\colon Ω\to\mathbf{R} on a probability space \langle Ω, P, \mathcal{A}\rangle. The function is \mathcal{A}-measurable if {\{ω\in Ω\mid X(ω) \le x\}\in\mathcal{A}}, x\in\mathbf{R}, and we write X\colon\mathcal{A}\to\mathbf{R}.

Exercise. If \mathcal{A} is finite then X is constant on its atoms.

Note that X is a function on atoms in this case.

The casual definition of the cumulative distribution function of a random variable as F(x) = P(X\le x) being the probability of X being less than or equal to x leaves out the important problem of specifying exactly what “probability” means.

The rigorous mathematical definition is {F_X(x) = P(\{ω\in Ω\mid X(ω) \le x\})} where P is a probability measure. We write F instead of F_X if X is understood. More generally, given a subset A\subseteq\mathbf{R} the probability that X takes a value in A is {P(X\in A) = P(\{ω\in Ω\mid X(ω)\in A\}}. The cdf corresponds to A = (-\infty, x]. Two random variables have the same law if they have the same cdf.

Exercise. Show U and 1 - U have the same law.

Note U \not= 1-U.

Exercise. If X has cdf F, then F(X) and U have the same law.

Hint: If F(x) jumps from a to b at x = c we define F^{-1}(u) = c for a \le u < b.

Solution

P(F(X) \le x) = P(X\le F^{-1}(x)) = F(F^{-1}(x)) = x for 0\le x\le 1.

Exercise. If X has cdf F, then X and F^{-1}(U) have the same law.

Solution

We have P(F^{-1}(U) \le x) = P(U\le F(x)) = F(x) since 0\le F(x)\le 1.

This shows a uniformly distributed random variable has sufficient randomness to generate any random variable. There are no random, random variables.

Exercise. Show P(a < X \le b) = F(b) - F(a).

Hint: [-\infty, b) = [-\infty, a) \cup [a, b).

Every cdf is non-decreasing, continuous from the right, has left limits, and \lim_{x\to-\infty}F(x) = 0, \lim_{x\to+\infty}F(x) = 1. Any function with these properties is the cdf of a random variable.

Exercise: Show F(x) \le F(y) if x < y.

Hint: (-\infty, x] \subset (-\infty, y] if x < y and P(x < X \le y) \ge 0.

Exercise: Show \lim_{y\downarrow x} F(y) = F(x).

Hint: \cap_{n=1}^\infty (-\infty, x + 1/n] = (-\infty, x].

Exercise: Show \lim_{y\uparrow x} F(y) = F(x-) exists.

Hint: If y_n is an increasing sequence with limit x then F(y_n) is a non-decreasing sequence bounded by F(x) so \sup_n F(y_n) exists and is not greater than F(x).

Exercise. Show \lim_{x\to-\infty}F(x) = 0.

Hint \cap_n (-\infty, -n] = \emptyset.

Exercise. Show \lim_{x\to\infty}F(x) = 1.

Hint \cup_n (-\infty, n] = (-\infty, \infty).

In general P(X\in A) = \int_A dF(x) for sufficiently nice subsets A\subset\mathbf{R} using Riemann–Stieltjes integration.

Define the conditional expectation of the random variable X with respect to the event A by E[X\mid A] = E[X 1_A]/P(A). If X = 1_B then this coincides with the definition of conditional expectation above.

Define the conditional expectation of X with respect to the algebra \mathcal{A}, E[X\mid \mathcal{A}]:\mathcal{A}\to\mathbf{R}, by E[X\mid \mathcal{A}](A) = E[X\mid A] for A an atom of \mathcal{A}.

Joint Distribution

Two random variables, X and Y, are defined by their joint distribution, H(x,y) = P(X\le x, Y\le y).

Exercise. Show the point (X,Y) is in the square (a,b]\times (c,d] with probability {P(a < X \le b, c < Y \le d) = P(X \le b, Y \le d) - P(X \le a) - P(Y \le c) + P(X \le a, Y \le c)}.

This allows computing the probability the point belongs to any set that is a countable union of squares, a measurable set.

Exercise. All convex sets are measurable.

In general, the joint distribution of X_1, \ldots, X_n is defined by {F(x_1,\ldots,x_n) = P(X_1\le x_1, \ldots, X_n\le x_n)}. If A is a measurable subset of \mathbf{R}^N we can use this to compute P((X_1,\ldots,X_n)\in A).

Independent

The random variables X and Y are independent if H(x,y) = F(x)G(y) for all x and y. This is equivalent to P(X\in A,Y\in B) = P(X\in A)P(Y\in B) for measurable sets A and B.

We also have that E[f(X)g(Y)] = E[f(X)] E[g(Y)] for any functions f and g whenever all expected values exist.

Exercise: Prove this for the case f = \sum_i a_i 1_{A_i} and g = \sum_j b_j 1_{B_j}.

In general, X_1, , X_n are independent if F(x_1,\ldots,x_n) = F_1(x_1)\cdots F_n(x_n), where F_j is the law of X_j.

Copula

A copula is the joint distribution of uniformly distributed random variables on the unit interval. The copula of X and Y is the joint distribution of F^{-1}(X) and G^{-1}(Y) where F and G are the cumulative distributions of X and Y respectively: C(u,v) = C^{X,Y}(u,v) = P(F^{-1}(X) \le u, G^{-1}(Y) \le v).

Exercise: Show C(u,v) = H(F(u),G(v)) where and H is the joint distribution of X and Y and F and G are the cumulative distribution of X, and Y.

Exercise: Show H(x,y) = C(F^{-1}(x), G^{-1}(y))

This shows how to use the copula and marginal distributions to recover the joint distribution.

An equivalent definition is a copula is a probability measure on [0,1]^2 with uniform marginals.

Exercise: Prove this.

If U and V are independent, uniformly distributed random variables on the unit interval then C(u,v) = uv.

If V=U then their joint distribution is C(u,v) = P(U\le u, V\le v) = P(U\le u, U\le v) = P(U\le \min\{u, v\}) = \min\{u,v\} = M(u,v).

If V=1-U then their joint distribution is C(u,v) = P(U\le u, V\le v) = P(U\le u, 1-U\le v) = P(1-v\le U\le u) = \max\{u - (1 -v), 0\} = \max\{u + v - 1, 0\} = W(u,v)

Exercise: (Fréchet-Hoeffding) For every copula, C, W \le C \le M.

Hint: For the upper bound use H(x,y) \le F(x) and H(x,y) \le G(y). For the lower bound note 0\le C(u_1,v_1) - C(u_1, v_2) - C(u_2, v_1) + C(u_2, v_2) for u_1 \ge u_2 and v_1 \ge v_2.

Uniform

A uniformly distributed random variable U on [0,1) has cdf F(x) = x if 0\le x\le 1, F(x) = 0 if x < 0, and F(x) = 1 if x > 1. Given a cdf F we can define a random variable having that law using the identity function X\colon\mathbf{R}\to\mathbf{R}, where X(x) = x. Let P be the probability measure on \mathbf{R} defined by P(A) = \int_A dF(x).

Continuous Random Variable

If the cdf satisfies F(x) = \int_{-\infty}^x F'(u)\,du we say the random variable is continuously distributed. The density function is f = F'. Any function satisfying f\ge 0 and \int_\mathbf{R} f(x)\,dx = 1 is a density function for a random variable.

Exercise. If X is continuously distributed show P(a < X \le b) = P(a \le X \le b) = P(a \le X < b) = P(a \le X \le b).

Expected Value

The expected value or mean of a random variable is defined by μ = E[X] = \int_Ω X\,dP. It is a measure of the location of X. More generally, the expected value of a function f\colon\mathbf{R}\to\mathbf{R} of a random variable is E[f(X)] = \int_Ω f(X)\,dP where f(X) is a random variable defined by f(X)(ω) = f(X(ω)), ω\in Ω. The functions f(x) = x^n are used to define moments.

Moment

The moments of a random variable are \mu_n = E[X^n] where n is a non-negative integer. The central moments are \bar{μ}_n = E[(X - E[X])^n]. The second central moment is the variance σ^2 = \operatorname{Var}(X) = E[(X - E[X])^2]. The standard deviation σ is a measure of the spread of X.

Exercise. Show \operatorname{Var}(X) = E[X^2] - E[X]^2.

Every random variable with non-zero variance can be standardized to have mean 0 and variance 1. The skew of a random variable is the third central moment of a standardized random variable. It is a measure of the lopsidedness of the distribution.

Exercise. If X and -X have the same law then its skew is 0.

The kurtosis of a random variable is the fourth central moment of a standardized random variable. It is a measure of how peaked a distribution is.

The moment generating function is \mu(s) = E[e^{sX}] = \sum_{n=0}^\infty m_n s^n/n!. Note \mu^{(n)}(0) = m_n by Taylor’s theorem if μ converges in a neighborhood of s = 0.

Moments don’t necessarily exist for all n, except for n = 0. They also cannot be an arbitrary sequence of values.

Suppose all moments of X exist, then for any complex numbers, (c_i), {0 \le E|\sum_i c_i X^i|^2 = E[\sum_{j,k} c_j\bar{c}_k X^{j+k}] = \sum_{j,k} c_j \bar{c}_k m_{j+k}}. This says the Hankel matrix M = [m_{j+k}]_{j,k} is positive definite. The converse is also true: if the Hankel matrix is positive definite there exists a random variable with the corresponding moments. This is not a trivial result and the random variable might not be unique.

Cumulant

The cumulant of X is the natural logarithm of its moment generating function {\kappa(s) = \log E[e^sX]}.

Exercise. Show κ_X(0) = 0, κ_X'(0) = E[X], and κ_X''(0) = \operatorname{Var}(X).

Exercise. Show if c is a constant then κ_{c + X}(s) = cs + κ_X(s).

Exercise. Show if c is a constant then κ_{cX}(s) = κ_X(cs).

Exercise. Show if X and Y are independent then κ_{X + Y}(s) = κ_X(s) + κ_Y(s).

Hint. X and Y are independent if and only if E[f(X)g(Y)] = E[f(X)]E[g(Y)] for any functions f and g.

The cumulants (κ_n) are the coefficients of the power series expansion κ(s) = \sum_{n>0}κ_n s^n/n!.

Exercise. Show κ_1 = E[X] and κ_2 = \mathrm{Var}(X).

The third and fourth cumulants are related to skew and kurtosis. If the variance is 1, then κ_3 is the skew and κ_4 is the excess kurtosis.

Exercise. Show κ_1(c + X) = c + κ_1(X) and κ_n(c + X) = κ_n(X), n \ge 2.

Exercise. _Show κ_n(cX) = c^n κ_n(X), n\ge 1.

Exercise. Show if X and Y are independent κ_n(X + Y) = κ_n(X) + κ_n(Y) for all n.

The moments of X, \mu_n = E[X^n] are related to the cumulants via complete Bell polynomials B_n(κ_1,\ldots,κ_n). E[e^{sX}] = \sum_{n\ge0} \mu_n s^n/n! = e^{κ(s)} = e^{\sum_{n>0} κ_n s^n/n!} = \sum_{n\ge0} B_n(κ_1,\ldots,κ_n) s^n/n! Taking a derivative with respect to s of the last equality gives the recurrence formula B_{n+1}(κ_1,\ldots,κ_{n+1}) = \sum_{k = 0}^n \binom{n}{k} B_{n - k}(κ_1,\ldots,κ_{n-k})κ_{k + 1}, n > 0. Note B_0 = 1

The cumulants are related to the moments via partial Bell polynomials B_{n,k}(\mu_1,\ldots,\mu_{n - k - 1}) κ_n = \sum_{k=0}^{n-1} (-1)^k k! B_{n,k+1}(\mu_1,\ldots,\mu_{n - k}) where (B_{n,k}) are partial Bell polynomials satisfying the recurrence B_{0,0} = 1, B_{n,0} = 0 for n > 0, B_{0,k} = 0 for k > 0 and B_{n,k}(x_1,\ldots,x_{n - k + 1}) = \sum_{i=1}^{n-k+1}\binom{n-1}{i - 1} B_{n-i,k-1}(x_1,\ldots,x_{n - i - k + 2})x_i

Normal

A standard normal random variable Z has density function φ(z) = \exp(-z^2/2)/\sqrt{2\pi}, -\infty < z < \infty.

Exercise. Show \int_{-\infty}^\infty \exp(-\pi x^2)\,dx = 1.

Solution

Let I = \int_{-\infty}^\infty \exp(-\pi x^2)\,dx. We compute I^2 using polar coordinates x = r\cos\theta, y = r\sin\theta. Since dx = -r\sin\theta\,d\theta + \cos\theta\,dr and dy = r\cos\theta\,d\theta + \sin\theta\,dr we have \begin{aligned} dx\,dy &= (-r\sin\theta\,d\theta + \cos\theta\,dr)(r\cos\theta\,d\theta + \sin\theta\,dr) \\ &= -r\sin\theta\,r\cos\theta\,d\theta\,d\theta -r\sin\theta\sin\theta\,d\theta\,dr + \cos\theta\,r\cos\theta\,dr\,d\theta + \cos\theta\sin\theta\,dr\,dr \\ &= -r\sin^2\theta\,d\theta\,dr + \cos^2\theta\,dr\,d\theta\\ &= r(\sin^2\theta + \cos^2\theta)\,dr\,d\theta\\ &= r\,dr\,d\theta\\ \end{aligned} using d\theta\,d\theta = 0, dr\,dr = 0, and d\theta\,dr = -dr\,d\theta. \begin{aligned} I^2 &= \int_{-\infty}^\infty \exp(-\pi x^2)\,dx \int_{-\infty}^\infty \exp(-\pi y^2)\,dy \\ &= \int_{-\infty}^\infty \int_{-\infty}^\infty \exp(-\pi x^2) \exp(-\pi y^2)\,dx\,dy \\ &= \int_{-\infty}^\infty \int_{-\infty}^\infty \exp(-\pi x^2 + y^2) \,dx\,dy \\ &= \int_{0}^{2\pi} \int_{0}^\infty \exp(-\pi r^2) r\,dr\,d\theta \\ &= \int_{0}^{2\pi} -\exp(-\pi r^2)/2\pi|_0^\infty\,d\theta \\ &= \int_{0}^{2\pi} 1/2\pi\,d\theta \\ &= 1 \\ \end{aligned}

Exercise. Show \int_{-\infty}^\infty \exp(-\alpha x^2)\,dx = \sqrt{\pi/\alpha}.

Hint: Use the change of variables \pi x^2 = \alpha y^2.

Exercise. Show \int_{-\infty}^\infty φ(x)\,dx = 1.

Hint: Use \alpha = 1/2.

Exercise. Show the moment generation function of a standard normal is {μ(s) = E[\exp(s Z)] = e^{s^2/2}}.

Hint: Complete the square.

Solution

\begin{aligned} E[\exp(s Z)] &= \int_{-\infty}^\infty e^{sz} e^{-z^2/2}\,dz/\sqrt{2\pi} \\ &= e^{s^2/2}\int_{-\infty}^\infty e^{-(z-s)^2/2}\,dz/\sqrt{2\pi} \\ &= e^{s^2/2}\int_{-\infty}^\infty e^{-z^2/2}\,dz/\sqrt{2\pi} \\ &= e^{s^2/2} \\ \end{aligned}

Exercise. Show E[e^{sZ} f(Z)] = E[e^{sZ}] E[f(Z + s)].

Hint: Complete the square.

Solution

\begin{aligned} E[\exp(s Z) f(Z)] &= \int_{-\infty}^\infty e^{sz} f(z) e^{-z^2/2}\,dz/\sqrt{2\pi} \\ &= e^{s^2/2}\int_{-\infty}^\infty f(z) e^{-(z-s)^2/2}\,dz/\sqrt{2\pi} \\ &= e^{s^2/2}\int_{-\infty}^\infty f(z + s) e^{-z^2/2}\,dz/\sqrt{2\pi} \\ &= E[e^{sZ}] E[f(Z + s)] \\ \end{aligned}

The cumulant of a standard normal random variable is κ(s) = \log μ(s) = s^2/2. This shows Z has mean 0 and variance 1.

Jointly Normal

If Z = (Z_1,\ldots,Z_n) are independent standard normally distributed random variables then the joint density function is {φ_n(z_1,\ldots,z_n) = \prod_{j=1}^n φ(z_j)}.

Exercise. If s\in\mathbf{R}^n show E[\exp(s\cdot Z) f(Z)] = E[\exp(s\cdot Z)] E[f(Z + s)].

Hint: Complete the square using |z - s|^2 = |z|^2 - 2z\cdot s + |s|^2.

We say N = (N_1,\ldots,N_n) are jointly normal if a\cdot N is normal for every a\in\mathbf{R}^n. Let μ = E[N] and Σ = \operatorname{Var}(N) = E[NN'] - E[N]E[N'].

Exercise. Show Z = Σ^{-1/2}(N - μ) are independent standard normally distributed.

Hint: By joint normality we know Z are normal. Show E[Z] = 0 and \operatorname{Var}(Z) = I, the n\times n identity matrix.

Concentration Inequalities

A common problem is determining when a sequence of random variables converges. Concentration inequalities can be used for that.

Lemma. (Chebyshev) If f is non-negative then P(f(X) > \lambda) \le E[f(X)]/\lambda.

Proof. We have E[f(X)] \ge E[f(X)1(f(X) > \lambda)] \ge \lambda P(f(X) > \lambda).

Note this only has import for large \lambda.

Exercise. For any non-negative random variable X and any increasing function \phi, P(X > \lambda) \le E[\phi(X)]/\phi(\lambda).

An immediate corollaries is P(|X| > \lambda) \le E[|X|]/\lambda.

Exercise. (Markov) Show P(|X - E[X]| > \lambda) \le \operatorname{Var}(X)/\lambda^2.

Hint: |X - E[X]| > \lambda if and only if |X - E[X]|^2 > \lambda^2.

Law of Large Numbers

A statistic is a function of random variables. If X is a random variable and (X_j) are independent and have have the same law as X let S_n = (X_1 + \cdots + X_n)/n.

Exercise. Show E[S_n] = E[X].

Exercise. Show \operatorname{Var}(S_n) \le \operatorname{Var}(X)/n.

Jensen’s Inequality

A function \phi\colon\mathbf{R}\to\mathbf{R} is convex if \phi(x) = \sup_{\lambda\le\phi} \lambda(x) and concave if \phi(x) = \inf_{\lambda\ge\phi} \lambda(x) where \lambda is linear.

Exercise. If \phi is convex then \phi(tx + (1 - t)x') \le t\phi(x) + (1 - t)\phi(x') if 0\le t\le 1.

Hint: \lambda(x) = \phi(x^*) + m(x - x^*) \le \phi(x) where x^* = t\phi(x) + (1 - t)\phi(x') and m = (\phi(x') - \phi(x))/(x' - x).

Theorem. If \phi is convex then E[\phi(X)] \ge \phi(E[X]).

For \lambda\le\phi linear we have E[\phi(X)] \ge E[\lambda(X)] = \lambda(E[X]) so E[\phi(X)] \ge \sup_{\lambda\le\phi}\lambda(E[X]) = \phi(E[X]).

Convergence

Random variables X_n converge to X in mean if E[|X_n - X|] converges to 0. They converges in mean square if \operatorname{Var}(X_n - X) converges to 0. They converges in probability if for all \epsilon > 0 P(|X_n - X|) > \epsilon) converges to 0. They converge almost surely if P(\lim_n X_n = X) = 1.

Exercise. If X_n converges in mean square then it converges in probability.

Hint: \phi(x) = x^2 is increasing for x > 0.

Unsorted

The traditional‘frequentist’ methods which use only sampling distributions are usable and useful in many particularly simple, idealized problems; however, they represent the most proscribed special cases of probability theory, because they presuppose conditions (independent repetitions of a ‘random experiment’ but no relevant prior information) that are hardly ever met in real problems. This approach is quite inadequate for the current needs of science.

Conditional

For a probability measure P on \Omega, the conditional probability of A given B is P(A|B) = P(AB)/P(B), A,B\subseteq\Omega.

Exercise. Show P(B|B) = 1 and P(A|\Omega) = A.

Exercise. Show A\mapsto P(A|B) is a probability measure on B.

Hint: Show P(A\cup A'|B) = P(A|B) + P(A'|B) - P(AA'|B).

Estimation

Given \langle \Omega, P\rangle, X\colon\Omega\to\mathbf{R}, P(X\le x) = F_\theta(x) for some \theta\in\Theta.

Given (independent) observations x_1, \ldots, x_n estimate \theta.

Define P^n on \Omega^n by P^n(A_1\times\cdots\times A_n) = P(A_1)\cdots P(A_n) for A_j\subseteq\Omega and \bm{X} = (X_1,\ldots,X_n)\colon\Omega^n\to\mathbf{R}^n by \bm{X}(\omega_1,\ldots,\omega_n) = (X_1(\omega_1),\ldots,X_n(\omega_n)).

Bernoulli

P(X = 0) = 1 - p, P(X = 1) = p so P(X = x) = p^x (1 - p)^{1 - x}, x\in\{0,1\}.

E[X^n] = p for all n > 0 since X^n = X..

\mu(s) = E[\exp(sX)] = 1 - p + e^s p = 1 + (e^s - 1)p

W_n = X_1 + \cdots X_n

Y = (X - p)/p(1 - p) has mean 0 and variance 1.

E[\exp(s(a + bX))] = \exp(as)\mu(bs))

E[\exp(sY)] = \exp(-s/(1-p))(1 - p + e^{s/p(1 - p)} p)

p = 1/2, E[\exp(sY)] = e^{-2s}(1/2 + e^{4s}/2) = \cosh(2s).

V_n = (Y_1 + \cdots + Y_n)/\sqrt{n}.

E[e^{sV_n} = \Pi E[e^{sY_n/sqrt{n}] = ?

e^{as}e^{bs} = e^{(a + b)s}

V_n = Y_1 + \cdots + Y_n.

\kappa(s) = \log(1 + (e^s - 1)p).

\kappa'(s) = e^sp/(1 + (e^s - 1)p) so \kappa'(0) = p

\kappa''(s) = ((1 + (e^s - 1)p)e^sp - e^sp e^sp )/(1 + (e^s - 1)p)^2 = ((1 + (e^s - 1)p - e^sp))e^sp/()^2 $ so \kappa'(0) = p

Let p be a random varialble with P(p = p_j) = q_j.

P(X = x) = \sum_j P(X = x|p = p_j) q_j = \sum_j p_j^x (1 - p_j)^{1 - x}q_j

P(p = p_j|X_0 = x_0) = P(p = p_j) P(X_0 = x_0|p = p_j)/P(X_0 = x_0) so P(p = p_j|X_0 = x_0) = q_j \frac{p_j^{x_0}(1 - p_j)^{1 - x_0}}{\sum_k p_k^{x_0} (1 - p_k)^{1 - {x_0}}q_k} and P(p = p_j|X_0 = 0) = q_j\frac{1 - p_j}{\sum_k (1 - p_k) q_k} and P(p = p_j|X_0 = 1) = q_j\frac{p_j}{\sum_k p_k q_k}

Suppose a uniform discrete prior p_j = j/n, 0\le j\le n, and q_j = 1/(n + 1).

P(X = x) = \sum_j (j/n)^x (1 - j/n)^{1 - x}/(n + 1).

P(X = 0) = \sum_j (1 - j/n)/(n + 1).

P(X = 1) = \sum_j j/n/(n + 1) = n(n+1)/2 n(n+1) = 1/2 = 1 - P(X = 0).

P(p = p_j|X_0 = 0) = 2q_j(1 - 1/(n+1)) and P(p = p_j|X_0 = 1) = 2q_j/(n+1)

Origins

Probability Space

Partition

Coin tossing

Measure

Discrete

Uniform

Conditional Expectation

Example

Random Variable

Joint Distribution

Independent

Copula

Uniform

Continuous Random Variable

Expected Value

Moment

Cumulant

Normal

Jointly Normal

Concentration Inequalities

Law of Large Numbers

Jensen’s Inequality

Convergence

Unsorted

Conditional

Estimation

Bernoulli

Timeline