Some Gaussians hidden in uniform measures…

Consider a vector {\theta\sim\textsf{N}(0, {\textrm{I}}/N)} where {\theta \in {\mathbb R}^N}. It is often claimed that this distribution is “very close”, in the limit of large {N}, to that of the uniform measure on the sphere {S_N \subset {\mathbb R}^N} of unit radius. This is, in fact, not hard to see (informally) using a Chernoff bound argument. Firstly, let {\theta_i} denote the {i^{\mathrm{th}}} entry of {\theta}. More precisely, the Chernoff argument yields the following inequalities:

\displaystyle  \begin{array}{rcl}  	\mathop{\mathbb P}\left(\sum_{i=1}^N\theta_i^2 \le 1-\epsilon\right)&\le& \exp\left\{ \frac{N}{2}(\epsilon - \log(1-\epsilon) \right\} \\ 	\mathop{\mathbb P}\left(\sum_{i=1}^N\theta_i^2 \ge 1+\epsilon \right)&\le& \exp\left\{ \frac{N}{2}(-\epsilon + \log(1+\epsilon) \right\}. \end{array}

The way to obtain this is pretty standard. For instance the first inequality can be obtained as:

\displaystyle  \begin{array}{rcl}  	\mathop{\mathbb P}\left(\sum_{i=1}^N\theta_i^2 \le 1-\epsilon\right)&=&\mathop{\mathbb P}\left[\exp(-\lambda \sum_{i=1}^N\theta_i^2) \ge \exp(-\lambda(1-\epsilon)) \right]\\ 	&\le& \mathop{\mathbb E}\left[ \exp(-\lambda\sum_{i=1}^N \theta_i^2) \right]\exp(\lambda(1-\epsilon)), \end{array}

where {\lambda\ge0} and the inequality is by Markov. Optimizing over {\lambda} yields the result. This tells us that {\lVert\theta\rVert^2} concentrates around the expected value of 1 at least to the extent of {O(N^{-1/2 + \delta})} for a small constant {\delta}. In fact one can do better than this using an exact large deviations calculation, which I will probably leave to another post.

Thus for large dimensions {N}, the probability mass of the distribution {\textsf{N}(0, {\textrm{I}}/N)} is concentrated in a shell of thickness {O(N^{-1/2})} and mean radius 1, which intuitively, looks very much like {S_N}. The question that interested me is how we can prove the converse: assuming {\theta\sim U(S_N)} where {U(S_N)} denotes the spherical measure on {S_N} do the marginals of each entry, say {\theta_1} look Gaussian? More precisely, let {\mu_N} denote the measure of {\sqrt N \theta_1} on {({\mathbb R}, \mathcal{B})}. Then, does {\mu_N} converge weakly to {\textsf{N}(0, 1)}? This is, in fact, true and is an old result due to Borel in his (book/treatise) “Introduction géometrique à quelques théorie physiques”. I will sketch a simple argument via characteristic functions.

The function {\phi(\lambda) \equiv \mathop{\mathbb E}(e^{\iota\lambda\sqrt N \theta_1})} is the characteristic function of {\sqrt N \theta_1 \sim \mu_N}, where {\iota = \sqrt{-1}}. We can compute this integral explicitly. The formula for the “area” of an {N}-dimensional spherical cap as given (here) is:

\displaystyle  \begin{array}{rcl}  	A &=& \frac{1}{2}A_N I_{2h-h^2}\left( \frac{N-1}{2}, \frac{1}{2} \right), \end{array}

where {A_N} is the surface area of {S_N}, {h} is the depth and {I_{2h-h^2}(\cdot, \cdot)} is the normalized incomplete beta function. Differentiating this, dividing by {A_N} and noting that {h \equiv 1 - \theta_1}, we can compute the characteristic function as the following integral:

\displaystyle  \begin{array}{rcl}  	\phi(\lambda) &=& B\left( \frac{n-1}{2}, \frac{1}{2} \right)^{-1} 	\int_{-1}^1 e^{\iota\lambda\sqrt N \theta_1}(1-\theta_1)^{(N-3)/2}\mathrm{d}\theta_1, \end{array}

where {B(\cdot, \cdot)} is the beta function. To do the above, we have to take care of a few sign conventions, which I have omitted. Now, {B(x, y) \approx \Gamma(y)x^{-y}} for fixed {y} and large {x} by Stirling’s approximation. With this and a change of variables {y \equiv \sqrt N \theta_1}, we obtain:

\displaystyle  \begin{array}{rcl}  	\phi(\lambda) &\approx& \frac{1}{\Gamma(\frac{1}{2}) \left( \frac{N-1}{2} \right)^{-1/2}\sqrt{N}} 	\int_{-\sqrt N}^{\sqrt N} e^{\iota\lambda y}\left( 1-\frac{y^2}{N} \right)^{(N-3)/2}\mathrm{d}y \\ 	&\rightarrow& \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty e^{\iota \lambda y} e^{-y^2/2}\mathrm{d}y, \end{array}

by dominated convergence. The last limit is precisely what we require, namely the characteristic function {\phi_G(\lambda) = \mathop{\mathbb E}(e^{\iota\lambda G})} where {G\sim\textsf{N}(0,1)}. The rest follows from Lévy’s continuity theorem.

There are significant improvements to this result. Diaconis and Freedman prove that this holds for all low-dimensional marginals, even in the sense of variation distance. More precisely, let {\theta_k \in {\mathbb R}^k} denote the restriction of {\theta} to the first {k} entries. They give a sharp bound on the variation distance between the law of {\sqrt{N}\theta_k} and {\textsf{N}(0, {\textrm{I}}_k)}. Interesting generalizations include this result by D’Aristotile, Diaconis and Newman which says that {\mathop{\textsf{Tr}}(AU) \approx \textsf{N}(0, 1)} when {U} is a orthogonal matrix of size {N\times N} chosen uniformly at random, and {\mathop{\textsf{Tr}}(AA^\textsf{T}) = N}, and others which state that even a “small” submatrix of {U} behaves as though its entries were iid standard normals.

I am sure I have barely scratched the surface here, but it is clear that many (and sometimes intuitively plausible) approximations of this kind are actually fairly accurate even rigorously.

UPDATE (04/05/13): Thanks to Jiantao Jiao for some corrections above.

Leave a comment