🎙️ Advanced Probability Seminar: From the Law of Large Numbers to the Large Deviation Principle
"The Strong Law of Large Numbers tells us where we are going; the Large Deviation Principle tells us the cost of getting lost."
Talk Abstract
- Theme: Limit behaviors of random sequences: From SLLN to Cramér's Large Deviation Theorem
- Duration: 90 Minutes (Full Session)
- Logical Structure:
- Convergence Foundations: Review of tail \(\sigma\)-algebras, 0-1 laws, and criteria for series convergence (Kolmogorov's Three-Series Theorem).
- Convergence Rates: Using Kronecker's Lemma to explore the convergence order of SLLN (Marcinkiewicz-Zygmund Theorem).
- Large Deviation Principle (LDP) Core: Studying exponentially small probabilities of deviating from the mean, deriving Cramér's Theorem and the rate function \(I(a)\).
- Core Difficulties: Intuitive understanding of the Legendre Transform, and the application of Change of Measure (Cramér Transform) in proving the lower bound.
1. Tail \(\sigma\)-field and 0-1 Laws
When studying the asymptotic behavior of a sequence of random variables, we often only care about those events that are "unaffected by any finite number of random variables." The collection of such events forms an extremely important algebraic structure—the tail \(\sigma\)-field.
Definition 1.1 (Tail \(\sigma\)-field)
Let \(\{X_n\}_{n \ge 1}\) be a sequence of random variables. Define \(\mathcal{F}_n' = \sigma(X_n, X_{n+1}, \dots)\) as the \(\sigma\)-algebra generated starting from the \(n\)-th variable. Define the tail \(\sigma\)-field \(\mathcal{T}\) as:
If an event \(A \in \mathcal{T}\), then for any \(n\), whether event \(A\) occurs or not is completely determined by \(\{X_n, X_{n+1}, \dots\}\), and is independent of the first \(n-1\) variables. We call events in \(\mathcal{T}\) tail events.
Classical Corollary: Borel-Cantelli Lemma I
If \(\sum_{n=1}^\infty P(A_n) < \infty\), then \(P(A_n \text{ i.o.}) = 0\). (Here \(\text{i.o.}\) stands for infinitely often, i.e., occurring infinitely many times). The limit superior \(\limsup A_n = \{A_n \text{ i.o.}\}\) is also a typical tail event.
1.1 Kolmogorov's 0-1 Law
Theorem 1.1 (Kolmogorov's 0-1 Law)
If the sequence \(X_1, X_2, \dots\) consists of mutually independent random variables, and \(A \in \mathcal{T}\) is a tail event, then:
Proof of Kolmogorov's 0-1 Law (Click to expand)
Core idea: Prove that event \(A\) is independent of itself.
Let \(\mathcal{F}_n = \sigma(X_1, \dots, X_n)\) and \(\mathcal{F}_{n+1}' = \sigma(X_{n+1}, X_{n+2}, \dots)\). Since the sequence \(\{X_n\}\) is mutually independent, the \(\sigma\)-algebras \(\mathcal{F}_n\) and \(\mathcal{F}_{n+1}'\) generated by disjoint sets of variables are also mutually independent.
For any tail event \(A \in \mathcal{T}\), since \(\mathcal{T} \subset \mathcal{F}_{n+1}'\), event \(A\) must belong to \(\mathcal{F}_{n+1}'\). Therefore, \(A\) is independent of \(\mathcal{F}_n\). Since this conclusion holds for all \(n\), we deduce that \(A\) is independent of the union of the algebras generated by the first finite number of variables \(\bigcup_{n=1}^\infty \mathcal{F}_n\).
According to the \(\pi-\lambda\) theorem (or monotone class theorem) in measure theory, since \(A\) is independent of the union algebra, \(A\) must be independent of the \(\sigma\)-algebra generated by it:
However, the tail \(\sigma\)-field \(\mathcal{T}\) itself is a sub-\(\sigma\)-algebra of \(\mathcal{F}_\infty\), so \(A \in \mathcal{F}_\infty\). In summary, \(A\) is independent of \(\mathcal{F}_\infty\), and simultaneously belongs to \(\mathcal{F}_\infty\). This means \(A\) must be independent of itself:
Solving this equation yields only \(P(A) = 0\) or \(P(A) = 1\). \(\square\)
1.2 Permutable Events and the Hewitt-Savage 0-1 Law
Besides tail events that discard the first finite terms, there is another class of events insensitive to the permutation order of a finite number of elements.
Definition 1.2 (Finite Permutation and Permutable Event)
- Finite Permutation: A mapping \(\pi: \mathbb{N} \rightarrow \mathbb{N}\) is called a finite permutation if it is a bijection and there are only finitely many \(i\) satisfying \(\pi(i) \ne i\).
- Permutable Event: If for any finite permutation \(\pi\), the preimage of event \(A\), denoted as \(\pi^{-1}(A) := \{\omega : \pi(\omega) \in A\}\), is always equal to \(A\), then \(A\) is called a permutable event.
All permutable events constitute the Exchangeable \(\sigma\)-field, denoted as \(\mathcal{E}\). Clearly, \(\mathcal{T} \subset \mathcal{E}\).
Theorem 1.2 (Hewitt-Savage 0-1 Law)
If \(X_1, X_2, \dots\) are independent and identically distributed (i.i.d.), and \(A \in \mathcal{E}\) is a permutable event, then:
Proof of the Hewitt-Savage 0-1 Law (Click to expand)
The basic idea is the same as Kolmogorov's 0-1 Law: Prove that \(P(A) = [P(A)]^2\).
For a permutable event \(A \in \mathcal{E} \subset \sigma(X_1, X_2, \dots)\), according to the measure approximation theorem, for any given \(\epsilon > 0\), there must exist a "cylinder set" event \(A_n \in \sigma(X_1, \dots, X_n)\) depending on the first \(n\) variables, such that the probability of their symmetric difference is extremely small:
This also implies that \(P(A_n) \rightarrow P(A)\).
Now, construct a specific finite permutation \(\pi_n\) that swaps the first \(n\) coordinates with the next \(n\) coordinates: \(\pi_n(1, \dots, n, n+1, \dots, 2n) = (n+1, \dots, 2n, 1, \dots, n)\).
Let \(A_n' = \pi_n(A_n)\). Since \(A_n\) only depends on \(X_1, \dots, X_n\), then \(A_n'\) will only depend on \(X_{n+1}, \dots, X_{2n}\). Because the sequence is i.i.d., \(A_n\) and \(A_n'\) are mutually independent and identically distributed. Therefore:
On the other hand, since \(A\) is a permutable event, \(\pi_n(A) = A\). Thus, for the symmetric difference operation, permutation does not change its probability:
Since both \(A_n\) and \(A_n'\) converge in probability to the same event \(A\), their intersection \(A_n \cap A_n'\) must also converge in probability to \(A\) itself (i.e., \(P(A_n \cap A_n') \rightarrow P(A)\)).
Combining the above two equations, we obtain:
The proof is complete. \(\square\)
2. Convergence Theorems for Random Series
To study the convergence of \(\sum X_n\), we need a powerful inequality tool to control the maximum value of local fluctuations.
Lemma 2.1 (Kolmogorov's Maximal Inequality)
Assume \(X_1, \dots, X_n\) are mutually independent, with mean 0, and finite variances. Denote the partial sum as \(S_k = \sum_{i=1}^k X_i\). For any \(x > 0\):
(Note: Compared to Chebyshev's inequality \(P(|S_n| \ge x) \le x^{-2} Var(S_n)\), the maximal inequality provides a stronger uniform bound.)
2.1 Kolmogorov's Convergence Theorem
Theorem 2.1 (Kolmogorov's Convergence Theorem)
Assume \(\{X_n\}\) is a sequence of mutually independent random variables, and \(E(X_n) = 0\). If the series of variances converges:
Then the random series \(\sum_{n=1}^\infty X_n\) converges almost surely (a.s.).
Proof of Kolmogorov's Convergence Theorem (Click to expand)
Let the partial sum be \(S_N = \sum_{n=1}^N X_n\). We need to prove that the sequence \(\{S_N\}\) is a Cauchy sequence in \(\mathbb{R}\) a.s.
Apply Kolmogorov's maximal inequality to examine the fluctuation on the interval \((M, N]\):
Since \(\sum Var(X_n) < \infty\), when \(M, N \rightarrow \infty\), the remainder of the series tends to 0. Letting \(N \rightarrow \infty\), by continuity:
This means that for any \(\epsilon > 0\), the probability of the maximum tail fluctuation tends to 0. This is equivalent to saying the probability that \(\{S_N\}\) is a Cauchy sequence is 1, hence the series converges almost surely. \(\square\)
2.2 Kolmogorov's Three-Series Theorem
Not all random variables have finite variances or expectations; in such cases, we need to apply the Truncation Method.
Theorem 2.2 (Kolmogorov's Three-Series Theorem)
Let \(X_1, X_2, \dots\) be mutually independent random variables, and take any constant \(A > 0\). Define the truncated variables:
Then a necessary and sufficient condition for the random series \(\sum_{n=1}^\infty X_n\) to converge almost surely is that the following three series converge simultaneously:
(i) \(\sum_{n=1}^\infty P(|X_n| > A) < \infty\)
(ii) \(\sum_{n=1}^\infty E(Y_n)\) converges
(iii) \(\sum_{n=1}^\infty Var(Y_n) < \infty\)
Proof of the Three-Series Theorem (Sufficiency) (Click to expand)
Given that conditions (i), (ii), and (iii) converge, we want to prove that the original series converges.
According to condition (iii), the variance series of the truncated variables converges: \(\sum Var(Y_n) < \infty\). Clearly, \(Y_n - E(Y_n)\) are mutually independent random variables with a mean of 0. According to Kolmogorov's Convergence Theorem (Theorem 2.1), the series \(\sum_{n=1}^\infty (Y_n - E(Y_n))\) converges almost surely.
Combined with condition (ii), the deterministic series \(\sum_{n=1}^\infty E(Y_n)\) is convergent. Adding these two together, we obtain:
Next, we need to transition the convergence of \(Y_n\) back to \(X_n\). According to condition (i), \(\sum_{n=1}^\infty P(|X_n| > A) < \infty\). Since \(P(|X_n| > A) = P(X_n \ne Y_n)\), according to the Borel-Cantelli Lemma I:
This means that, in a sample space with a probability of 1, there are at most finitely many \(n\) satisfying \(X_n \ne Y_n\). In other words, for sufficiently large \(n\), \(X_n\) and \(Y_n\) are eventually completely identical. Therefore, the convergence behavior of the series \(\sum X_n\) and \(\sum Y_n\) must be exactly the same. Since \(\sum Y_n\) converges a.s., then \(\sum X_n\) must also converge a.s. The sufficiency is proven. \(\square\)
3. Strong Law of Large Numbers (SLLN)
Before discussing the law of large numbers, we need a fundamental lemma from mathematical analysis, which can convert "series convergence" into "weighted average convergence".
Lemma 3.1 (Kronecker's Lemma)
Assume \(\{a_n\}\) is a strictly increasing sequence of positive numbers tending to infinity, i.e., \(0 < a_1 \le a_2 \dots \uparrow \infty\). If the real series \(\sum_{n=1}^\infty \frac{x_n}{a_n}\) converges, then necessarily:
3.1 SLLN under finite condition
Theorem 3.1 (SLLN under finite condition)
Let \(X_1, X_2, \dots\) be i.i.d. random variables, and \(E(X_1) = \mu, E(X_1^2) < \infty\). Denote the partial sum \(S_n = \sum_{i=1}^n X_i\), then as \(n \rightarrow \infty\):
Truncation Proof Method for the Strong Law of Large Numbers (Click to expand)
Step 1: Construct truncated variables and verify the variance series Define the dynamically truncated variables varying with \(k\):
Define the centered variables \(Z_k = Y_k - E(Y_k)\). Clearly \(E(Z_k) = 0\) and they are mutually independent. Examine its variance series:
Since \(\{X_k\}\) are identically distributed, the above equation equals:
Step 2: Apply series convergence and Kronecker's Lemma According to Kolmogorov's Convergence Theorem, the series \(\sum_{k=1}^\infty \frac{Z_k}{k}\) converges a.s. Apply Kronecker's Lemma (taking \(a_n = n\)):
Step 3: Expectation asymptotics and B-C Lemma transition By the Dominated Convergence Theorem (DCT), \(E(Y_k) \rightarrow E(X_1) = \mu\). By the property of Césaro means, \(\frac{1}{n} \sum_{k=1}^n E(Y_k) \rightarrow \mu\). Therefore:
Finally, by Chebyshev's inequality and the B-C Lemma:
Thus \(P(X_k \ne Y_k \text{ i.o.}) = 0\). This means that, in a limit sense, replacing \(X_k\) with \(Y_k\) does not change the convergence of the average, therefore \(\frac{S_n}{n} \xrightarrow{a.s.} \mu\). \(\square\)
(Note: Khinchin weakened the premise of variance existence, proving that only the first moment \(E|X_1| < \infty\) is needed to guarantee the strong law of large numbers holds, which involves finer integration bounding controls.)
4. Generalized Convergence Rates and Infinite Expectations
When studying convergence beyond the \(1/n\) scaling factor, we often need to adjust the order conditions for the existence of moments.
4.1 Convergence Rate with Finite Variance
Theorem 4.1
Let \(X_1, X_2, \dots\) be i.i.d. with \(E(X_i) = 0, E(X_i^2) = \sigma^2 < \infty\). For any given \(\delta > 0\), we have:
Brief Proof (Click to expand)
Let the normalization coefficient be \(a_n = n^{1/2}(\log n)^{1/2 + \delta}\). Consider the series \(\sum_{n=1}^\infty \frac{Var(X_n)}{a_n^2} = \sigma^2 \sum_{n=1}^\infty \frac{1}{n (\log n)^{1+2\delta}}\). Since the integral \(\int \frac{1}{x (\log x)^{1+p}} dx < \infty\) (\(p>0\)), the series converges. According to Theorem 2.1, \(\sum_{n=1}^\infty \frac{X_n}{a_n}\) converges almost surely. Applying Kronecker's Lemma directly yields the conclusion. \(\square\)
4.2 Marcinkiewicz-Zygmund Strong Law of Large Numbers
If the second moment of a random variable does not exist, but there is a \(p\)-th order moment between the first and second order, what kind of convergence rate will we have?
Theorem 4.2
Let \(X_1, X_2, \dots\) be i.i.d. with \(E(X_i) = 0, E|X_i|^p < \infty\) where \(1 < p < 2\). Then:
Proof: Refined Truncation Method (Click to expand)
Define the truncated variable \(Y_m = X_m \mathbb{I}_{\{|X_m| \le m^{1/p}\}}\).
(1) Legitimacy of the transition:
By B-C Lemma I, we only need to prove that the truncated sequence \(\sum Y_m / m^{1/p} \rightarrow 0\).
(2) Convergence of the truncated variance series: Center the variables \(Z_m = Y_m - E(Y_m)\). We want to prove \(\sum_{m=1}^\infty \frac{E(Y_m^2)}{m^{2/p}} < \infty\).
Interchange the order of summation and integration (Fubini's Theorem):
Note the tail sum bound \(\sum_{m \ge k} m^{-2/p} \le C \cdot k^{1 - 2/p}\). Substituting this yields:
(3) Applying Kronecker's Lemma: Since \(\sum Var(Z_m) / (m^{1/p})^2 < \infty\), the series \(\sum \frac{Z_m}{m^{1/p}}\) converges a.s. By Kronecker's Lemma, \(\frac{1}{n^{1/p}} \sum_{m=1}^n Z_m \rightarrow 0\) a.s.
(4) Handling the expectation drift: Since \(E(X_m) = 0\), we have \(E(Y_m) = -E[X_1 \mathbb{I}_{\{|X_1| > m^{1/p}\}}]\). Through similar integration bounds, it can be shown that \(\frac{1}{n^{1/p}} \sum_{m=1}^n E(Y_m) \rightarrow 0\). Combining this with (3) yields \(\frac{T_n}{n^{1/p}} \rightarrow 0\), which finally leads to \(\frac{S_n}{n^{1/p}} \xrightarrow{a.s.} 0\). \(\square\)
4.3 Infinite Mean Divergence
If even the first-order expectation does not exist (i.e., the expectation is infinite), can the Law of Large Numbers still provide a bound?
Theorem 4.3 (Infinite Mean Divergence)
Let \(X_1, X_2, \dots\) be i.i.d. with \(E|X_i| = \infty\). Let \(\{a_n\}\) be any sequence of positive numbers such that \(a_n / n\) is increasing. Then the limit superior must be extreme:
Proof: Using Borel-Cantelli Lemma II (Click to expand)
For any constant \(k > 0\), since \(a_n / n\) is increasing and \(a_n > 0\) (we can assume \(a_n\) grows at least linearly), coupled with \(E|X_1| = \infty\), the integral must diverge:
Since the sequence is mutually independent, according to Borel-Cantelli Lemma II, the event \(\{|X_n| \ge k a_n\}\) will occur infinitely often (i.o.). This means:
Now consider the relationship between the partial sum and a single term: \(X_n = S_n - S_{n-1}\). By the triangle inequality:
If \(\limsup \frac{|S_n|}{a_n} = M < \infty\), then the right side of the inequality would be bounded by \(2M\), which contradicts the fact that the limit superior on the left side is \(\infty\)! Therefore, we can only conclude:
(If \(a_n\) grows too fast, such as exponential growth causing the series to converge, then the limit might collapse to \(0\). So the result can only be one of these two extreme situations). \(\square\)
5. Large Deviation Principle (LDP)
The Strong Law of Large Numbers tells us that \(\frac{S_n}{n}\) will converge to \(\mu\) with probability 1. But in statistical physics or actuarial science, we are more concerned with: When \(a > \mu\), how fast does \(P(\frac{S_n}{n} \ge a)\) tend to 0?
5.1 Moment Generating Function and Rate Function
Before studying exponential convergence, we need to characterize the "tail properties" of the distribution.
Definition 5.1 (Moment Generating Function and Log-Moment Generating Function)
Let \(X, X_1, X_2, \dots\) be i.i.d. random variables:
-
Moment Generating Function (MGF): \(\phi(\theta) = E[e^{\theta X}]\)
-
Log-Moment Generating Function: \(\Lambda(\theta) = \log \phi(\theta)\)
Definition 5.2 (Rate Function / Legendre Transform)
Define the Legendre Transform \(I(a)\) of \(\Lambda(\theta)\) as:
In convex analysis, \(I(a)\) is also called the convex conjugate function. It has lower semi-continuity and convexity, and it achieves its minimum value of \(0\) at \(a = E[X]\).
5.2 Cramér's Theorem
This is the foundational theorem of large deviation theory, revealing that the probability of the sample mean deviating from the expectation decays exponentially.
Theorem 5.1 (Cramér's Theorem in \(\mathbb{R}\))
Let \(\{X_n\}\) be an i.i.d. sequence satisfying \(\phi(\theta) < \infty\) for all \(\theta \in \mathbb{R}\). Denote \(S_n = \sum_{i=1}^n X_i\). Then for any \(a > E[X_1]\), we have:
This implies that when \(n\) is large: \(P(S_n \ge na) \approx e^{-n I(a)}\).
5.3 Proof of Cramér's Theorem
The proof is divided into two parts: the upper bound (using Chernoff bounds) and the lower bound (using a change of measure).
5.3.1 Proof of the Upper Bound
Proof: \(\limsup_{n \rightarrow \infty} \frac{1}{n} \log P(S_n \ge na) \le -I(a)\) (Click to expand)
For any \(\theta > 0\), applying Markov's inequality (Chernoff Bound):
Since \(\{X_n\}\) are i.i.d., \(E[e^{\theta S_n}] = (E[e^{\theta X_1}])^n = [\phi(\theta)]^n\). Substituting this yields:
Since the above inequality holds for all \(\theta > 0\), we take the supremum:
Taking the logarithm on both sides and dividing by \(n\):
When \(a > E[X_1]\), it can be shown that \(\sup_{\theta > 0}\) is equivalent to the global \(\sup_{\theta \in \mathbb{R}}\) (because when \(\theta \le 0\), \(\theta a - \Lambda(\theta)\) cannot attain its maximum at \(a > E[X]\)). Therefore:
\(\square\)
5.3.2 Proof of the Lower Bound
This is the most brilliant part of the notes, utilizing a Change of Measure. Its core idea is: construct a new probability measure such that the original "rare event" becomes a "high-frequency event (where the Law of Large Numbers holds)" under the new measure.
Proof: \(\liminf_{n \rightarrow \infty} \frac{1}{n} \log P(S_n \ge na) \ge -I(a)\) (Click to expand)
Step 1: Construct the New Measure (Cramér Transform) Let \(F\) be the original distribution function of \(X\). For a chosen parameter \(\lambda\), define the new distribution function \(F_\lambda\):
Under the new measure \(P_\lambda\), the expectation of the random variable \(X\) is:
Step 2: Select the Optimal Parameter \(\lambda_a\) We select a specific \(\lambda_a\) such that under the new measure, \(E_{\lambda_a}[X] = a\). At this point, according to convex analysis theory, we have exactly \(I(a) = \lambda_a a - \Lambda(\lambda_a)\).
Step 3: Estimate Original Probability using SLLN under the New Measure Let \(P_{\lambda_a}^n\) be the joint distribution of \(n\) random variables that are independent and identically distributed according to \(F_{\lambda_a}\). Then we have the Radon-Nikodym derivative:
Consider the probability over a small neighborhood \((a, a+\epsilon)\):
Since within this integration region \(S_n \le n(a+\epsilon)\), and \(e^{-\lambda_a S_n} \ge e^{-\lambda_a n(a+\epsilon)}\), we get:
Note that under the new measure \(P_{\lambda_a}\), \(E_{\lambda_a}[S_n/n] = a\). According to the Law of Large Numbers (LLN):
Taking the logarithm on both sides and dividing by \(n\):
Letting \(\epsilon \rightarrow 0\), we obtain the lower bound \(-I(a)\). \(\square\)
6. Intuitive Understanding and Applications of LDP
The Large Deviation Principle plays an extremely important role in statistical inference.
6.1 Rate Functions \(I(a)\) of Common Distributions
| Distribution Type | Parameters | Rate Function \(I(a)\) | Physical/Statistical Significance |
|---|---|---|---|
| Normal Distribution | \(N(\mu, \sigma^2)\) | \(\frac{(a-\mu)^2}{2\sigma^2}\) | Probability decays exponentially with the squared Euclidean distance |
| Bernoulli Distribution | \(p\) | \(a \log \frac{a}{p} + (1-a) \log \frac{1-a}{1-p}\) | i.e., the KL Divergence between \(a\) and \(p\) |
| Poisson Distribution | \(\lambda\) | \(a \log \frac{a}{\lambda} - a + \lambda\) | Deviation penalty for rare event counts |
6.2 Implications for Statistical Inference
- Maximum Likelihood Estimation (MLE): In an asymptotic sense, the consistency of MLE can be characterized by LDP to describe the convergence rate of its error probability.
- Hypothesis Testing (Sanov's Theorem): The probability decay rates of Type I and Type II errors are usually determined by LDP.
- "Small Probability" Decisions in Large Samples: In the context of big data (where \(n\) is large), although the sample mean converges to the expectation, if we must face extreme values of \(a\), LDP provides a specific quantitative metric for the risk.
"The essence of LDP is the change of measure: it turns a miracle into a mundane reality."