Large Deviation Theory


Large Deviation Theory#

Part of a series: Stochastic fundamentals.

Follow reading here

Consider (again) a sequence of iid random variables \(X_1,X_2,X_3,\dots\) with expected value \(\mu = E(X_1)\). The arithmetic mean \(S_n = \frac 1 n \sum_{j=1}^n X_j\) tends to \(\mu\) in the limit \(n \rightarrow \infty\) in some sense. But what does this mean precisely and how fast is this convergence?

  • The weak law of large numbers gives a precise meaning to the convergence \(P\big(|S_n - \mu| > \epsilon \big) \overset{n \rightarrow \infty}{\longrightarrow} 0\) but it does not tell us anything about the ??? convergence.

  • The Tschebyschev inequality yields \(P\big(|S_n - \mu| > \epsilon \big) \leq \frac{V(X_1)}{n \epsilon^2}\) This suggests a convergence at order \(\frac 1 n\). But actually, this is a worst-case result and in many cases convergence is much much faster.

  • The Central limit theorem yields \(\sqrt{n} /S_n - \mu) \overset{d}{\longrightarrow} \mathcal{N} (0,1)\) such that \(P\big(|S_n - \mu| > \epsilon \big) \overset{n \rightarrow \infty}{\longrightarrow} 2 \Phi (\epsilon \sqrt{n})\) But this is an approximation, not a strict upper bound. Furthermore, the convergence is much worse in the tails than in the center of the distribution.

Large deviation theory now gives bounds of the tail \(P\big(|S_n - \mu| > \epsilon \big) \leq e^{- c n}\) for many cases of practical interest. Notably, these bounds are rather tight such that the speed of convergence is indeed exponential with rate \(c\)! To establish these results we first need a few more technical aspects.


Let \(X\) be a read random variable. The moment generating function is defined as \(M_X(\theta) = E(e^{\theta x})\) wherever the expected value exists.

Properties and remarks

  • The moment generating function is not guaranteed to exist for \(\theta \neq 0\) – in contrast to the characteristic function \(\phi_X(t)\).

  • \(M_X(0) = 1\).

  • The domain of \(M_X(\theta)\) is an interval containing zero: \(\partial(M) = [\theta_1,\theta_2] \text{ with } \theta_1 \leq 0 \leq \theta_2\) \(\theta_1 = \theta_2 = 0\) is possible.

  • Suppose \((\theta_1,\theta_2) \subset \partial(M)\) and \(\theta_1 < \theta_0 < \theta_2\).
    Then \(\frac{d}{d\theta} M_X(\theta) \bigg|_{\theta= \theta_0} = E \left( X e^{\theta_0 X}\right) < \infty\) i.e. differentiation and expeced value can be exchanged at \(\theta_0\).

  • \(M_X(\theta)\) generates the moments in the sence \(E(X^n) = \frac{d^n M_X(\theta)}{d\theta^n}\bigg|_{\theta = 0}\) provided that these moments exist.

Theorem (Chernott bound) Let \(X_1,X_2,\dots\) be a sequence of iid random variables with finite expected value \(\mu = E(X_1)\) and with \(M_{X_1} (\theta)\) being finite in some interval \((\theta_1,\theta_2) \ni 0\). Denote the arithmetic mean at the \(X_j\) as \(S_n = \frac 1 n \sum_{j=1}^n X_j\). Then for all \(a > \mu\) we find a \(\theta \in (0,\theta_2)\) such that

  1. \(P(S_n > a) \leq \left[ \frac{M_{X_1} (\theta)}{e^{\theta a}} \right]^n\)

  2. \(\frac{M_{X_1} (\theta)}{e^{\theta a}} < 1\)


  1. \(\begin{aligned} P(S_n > a) =& P\Big(\sum_j X_j > n a \Big)\\ =& P \left(e^{\theta \sum_j X_j} > e^{\theta n a} \right)\\ &\overset{\text{Markov}}{\leq} \frac{E\left(e^{\theta \sum_j X_j}\right)}{e^{\theta n a}}\\ =& \left[ \frac{M_{X_1} (\theta)}{e^{\theta a}} \right]^n \end{aligned}\)


    a) Normal distribution:
    \(\begin{aligned} f_X(x) &= \frac{1}{\sigma \sqrt{2\pi}} e^{- \frac{(x-\mu)^2}{2 \sigma^2}}\\ \phi_X(t) &= e^{i t \mu - \frac 1 2 \sigma^2 t^2}\\ M_X(\theta) &= e^{\theta \mu + \frac 1 2 \sigma^2 t^2} \end{aligned}\) \
    b) Cauchy distribution:
    \(\begin{aligned} f_X(x) &= \frac{\gamma}{\pi \gamma^2 + \pi (x-\mu)^2}\\ \phi_X(t) &= e^{i t \mu - \gamma(t)}\\ M_X(\theta) &\text{ does not exist for } \theta \neq 0\,. \end{aligned}\)

    Lemma (Markov inequality)

    Let \(X\) be a non-negative random variable and \(a>0\).
    Then \(P(X>a) \leq \frac{E(X)}{a}\)

    Proof Let \(\mathbb{1}_{X\geq a}\) be the indicator function for the event \(X \geq a\).
    Then \(\begin{aligned} a \mathbb{1}_{X\geq a} &\leq X\\ E(a \mathbb{1}_{X\geq a}) &\leq E(X)\\ a P(X\geq a) &\leq E(X) \end{aligned}\) \(\Box\)

  2. We have \(\begin{aligned} \frac{M_x (\theta)}{e^{\theta a}} \bigg|_{\theta=0} &= \frac 1 1 = 1\\ \frac{d}{d \theta}\frac{M_x (\theta)}{e^{\theta a}} \bigg|_{\theta=0} &= \frac{E\left(X e^{\theta X}\right) e^{\theta a} - a e^{\theta a} E\left(e^{\theta X}\right)}{e^{2\theta a}}\bigg|_{\theta = 0}\\ &= \frac{E(X) - a E(1)}{1} = \mu -a\\ &< 0 ~~~~ \text{for} a>\mu \,. \end{aligned}\)

    For a sufficiently small \(\theta>0\) we must have \(\frac{M(\theta)}{e^{\theta a}} < 1\) \(\Box\)


Similarly, we obtain a bound for \(a < \mu\): \(P(S_n < a) \leq \bigg[ \underbrace{\frac{M_{X_1} (\theta)}{e^{\theta a}}}_{<1} \bigg]^n\) for some \(\theta <0\).

We now want this bound to be as tight as possible by choosing an appropriate \(\theta\). To this end we rewrite: \(P(S_n > a) \leq e^{-n \underbrace{\left[ \theta a - \ln M_{X_1}(\theta) \right]}_{\text{should be as large as possible!}}}\)

Definition A Legrendre transformation of a random variable \(X\) is the function \(I_X (a) := \sup_{\theta \in \mathbb{R}} \left[ \theta a - \ln M_X(\theta)\right]\)

Corrolary Using the notation as above, we have \(P(S_n > a) \leq e^{- n I_{X_1} (a)}\)

Note This bound becomes tight for \(n \rightarrow \infty\) which is shown by Cramer’s theorem.


  1. Normal distribution \(X_1 \sim \mathcal{N} (0,1)\)

    \[\begin{split}\begin{aligned} &M_{X_1} (\theta) = e^{\theta^2/2}\\ \Rightarrow~~ &I_{X_1} (a) = \sup_{\theta} \left(\theta a - \frac{\theta^2}{2}\right) = \frac{a^2}{2}\\ \Rightarrow~~ &P(S_n > a) \leq e^{-n \frac{a^2}{2}} \end{aligned}\end{split}\]
  2. Poisson distribution \(X_1 \sim Poi(\lambda)\)

    \[\begin{split}\begin{aligned} &M_{X_1} (\theta) = E\left(e^{\theta X_1}\right) = \sum_{m=0}^\infty e^{\theta m} \frac{\lambda^m}{m!} e^{-\lambda} = \sum_{m=0}^\infty \frac{\left( e^{\theta} \lambda\right)^m}{m!} e^{-\lambda} = e^{e^\theta \lambda - \lambda}\\ \Rightarrow~~ &I_{X_1} (a) = \sup_{\theta} \left(\theta a - e^{e^\theta \lambda - \lambda}\right) = a \ln \left(\frac{a}{\lambda}\right) - (a - \lambda) ~~~~~ ~~~~~ \text{for } \theta = \ln \left(\frac{a}{\lambda}\right) \\ \Rightarrow~~ &P(S_n > a) \leq e^{-n \ln (a/\lambda) - a + \lambda} \end{aligned}\end{split}\]


Philipp Böttcher, Dirk Witthaut