Some Gaussians hidden in uniform measures…

February 5, 2013

Consider a vector {\theta\sim\textsf{N}(0, {\textrm{I}}/N)} where {\theta \in {\mathbb R}^N}. It is often claimed that this distribution is “very close”, in the limit of large {N}, to that of the uniform measure on the sphere {S_N \subset {\mathbb R}^N} of unit radius. This is, in fact, not hard to see (informally) using a Chernoff bound argument. Firstly, let {\theta_i} denote the {i^{\mathrm{th}}} entry of {\theta}. More precisely, the Chernoff argument yields the following inequalities:

\displaystyle  \begin{array}{rcl}  	\mathop{\mathbb P}\left(\sum_{i=1}^N\theta_i^2 \le 1-\epsilon\right)&\le& \exp\left\{ \frac{N}{2}(\epsilon - \log(1-\epsilon) \right\} \\ 	\mathop{\mathbb P}\left(\sum_{i=1}^N\theta_i^2 \ge 1+\epsilon \right)&\le& \exp\left\{ \frac{N}{2}(-\epsilon + \log(1+\epsilon) \right\}. \end{array}

The way to obtain this is pretty standard. For instance the first inequality can be obtained as:

\displaystyle  \begin{array}{rcl}  	\mathop{\mathbb P}\left(\sum_{i=1}^N\theta_i^2 \le 1-\epsilon\right)&=&\mathop{\mathbb P}\left[\exp(-\lambda \sum_{i=1}^N\theta_i^2) \ge \exp(-\lambda(1-\epsilon)) \right]\\ 	&\le& \mathop{\mathbb E}\left[ \exp(-\lambda\sum_{i=1}^N \theta_i^2) \right]\exp(\lambda(1-\epsilon)), \end{array}

where {\lambda\ge0} and the inequality is by Markov. Optimizing over {\lambda} yields the result. This tells us that {\lVert\theta\rVert^2} concentrates around the expected value of 1 at least to the extent of {O(N^{-1/2 + \delta})} for a small constant {\delta}. In fact one can do better than this using an exact large deviations calculation, which I will probably leave to another post.

Thus for large dimensions {N}, the probability mass of the distribution {\textsf{N}(0, {\textrm{I}}/N)} is concentrated in a shell of thickness {O(N^{-1/2})} and mean radius 1, which intuitively, looks very much like {S_N}. The question that interested me is how we can prove the converse: assuming {\theta\sim U(S_N)} where {U(S_N)} denotes the spherical measure on {S_N} do the marginals of each entry, say {\theta_1} look Gaussian? More precisely, let {\mu_N} denote the measure of {\sqrt N \theta_1} on {({\mathbb R}, \mathcal{B})}. Then, does {\mu_N} converge weakly to {\textsf{N}(0, 1)}? This is, in fact, true and is an old result due to Borel in his (book/treatise) “Introduction géometrique à quelques théorie physiques”. I will sketch a simple argument via characteristic functions.

The function {\phi(\lambda) \equiv \mathop{\mathbb E}(e^{\iota\lambda\sqrt N \theta_1})} is the characteristic function of {\sqrt N \theta_1 \sim \mu_N}, where {\iota = \sqrt{-1}}. We can compute this integral explicitly. The formula for the “area” of an {N}-dimensional spherical cap as given (here) is:

\displaystyle  \begin{array}{rcl}  	A &=& \frac{1}{2}A_N I_{2h-h^2}\left( \frac{N-1}{2}, \frac{1}{2} \right), \end{array}

where {A_N} is the surface area of {S_N}, {h} is the depth and {I_{2h-h^2}(\cdot, \cdot)} is the normalized incomplete beta function. Differentiating this, dividing by {A_N} and noting that {h \equiv 1 - \theta_1}, we can compute the characteristic function as the following integral:

\displaystyle  \begin{array}{rcl}  	\phi(\lambda) &=& B\left( \frac{n-1}{2}, \frac{1}{2} \right)^{-1} 	\int_{-1}^1 e^{\iota\lambda\sqrt N \theta_1}(1-\theta_1)^{(N-3)/2}\mathrm{d}\theta_1, \end{array}

where {B(\cdot, \cdot)} is the beta function. To do the above, we have to take care of a few sign conventions, which I have omitted. Now, {B(x, y) \approx \Gamma(y)x^{-y}} for fixed {y} and large {x} by Stirling’s approximation. With this and a change of variables {y \equiv \sqrt N \theta_1}, we obtain:

\displaystyle  \begin{array}{rcl}  	\phi(\lambda) &\approx& \frac{1}{\Gamma(\frac{1}{2}) \left( \frac{N-1}{2} \right)^{-1/2}\sqrt{N}} 	\int_{-\sqrt N}^{\sqrt N} e^{\iota\lambda y}\left( 1-\frac{y^2}{N} \right)^{(N-3)/2}\mathrm{d}y \\ 	&\rightarrow& \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty e^{\iota \lambda y} e^{-y^2/2}\mathrm{d}y, \end{array}

by dominated convergence. The last limit is precisely what we require, namely the characteristic function {\phi_G(\lambda) = \mathop{\mathbb E}(e^{\iota\lambda G})} where {G\sim\textsf{N}(0,1)}. The rest follows from Lévy’s continuity theorem.

There are significant improvements to this result. Diaconis and Freedman prove that this holds for all low-dimensional marginals, even in the sense of variation distance. More precisely, let {\theta_k \in {\mathbb R}^k} denote the restriction of {\theta} to the first {k} entries. They give a sharp bound on the variation distance between the law of {\sqrt{N}\theta_k} and {\textsf{N}(0, {\textrm{I}}_k)}. Interesting generalizations include this result by D’Aristotile, Diaconis and Newman which says that {\mathop{\textsf{Tr}}(AU) \approx \textsf{N}(0, 1)} when {U} is a orthogonal matrix of size {N\times N} chosen uniformly at random, and {\mathop{\textsf{Tr}}(AA^\textsf{T}) = N}, and others which state that even a “small” submatrix of {U} behaves as though its entries were iid standard normals.

I am sure I have barely scratched the surface here, but it is clear that many (and sometimes intuitively plausible) approximations of this kind are actually fairly accurate even rigorously.

UPDATE (04/05/13): Thanks to Jiantao Jiao for some corrections above.

Coupon collecting

January 23, 2013

Coupon collecting is a classical homework problem in undergraduate probability. Suppose there exist {n} different types of coupons. At each time one selects a coupon of a uniformly random type. In expectation, how many coupons does one need to collect before one has at least one coupon of every type? The analogous problem of throwing balls in bins is similar: suppose there were {n} bins and one threw balls uniformly randomly into the bins. How many balls would one throw, in expectation, before all the bins were filled?

By the pigeonhole principle one would have to collect at least {n} coupons, but this is not enough. With probability {1 - n!/n^n > 0} one would get at least one overlap and thus, at least one bin would be empty. If {\tau} is the (random) number of coupons required to cover all the types, then the task is to find {\mathop{\mathbb E}\tau}.

It is useful to approach this in an inductive fashion: assume that {k} types have already been collected. Since the number of selections required to get a new type is a geometric random variable of success probability {p_k = 1-k/n}. Thus, in expectation, one needs {n/(n-k)} tries to get a new type after collecting {k} types. This yields that, starting from nothing, one needs {\sum_{i=1}^{n-1} n/(n-i) = nH_{n-1}} tries in expectation, where {H_{n}} denotes the harmonic sum upto {n}. This number is roughly {\log n} yielding that {\mathop{\mathbb E}\tau \approx n\log n}.

More interesting forms of “coupon-collecting” type questions are, say, the number of tries are needed to obtain a coupon of each type with high probability (and not just in expectation). Or, assuming one collected {n} coupons u.a.r as before, what would be the rough size of the coupon type with maximum collection. It is interesting how relatively simple problems arise in unexpected places. I started this post after noting that the same “coupon collector” effect yields a lower bound of {O(nr\log(n))} randomly sampled entries to recover a rank {r} matrix of size {n} (as proved Tao and Candes’ paper on low rank matrix recovery).

 

“Solving” Sudoku using Belief Propagation

September 9, 2012

I recently read this post by Tejal Bhamre on Sudoku as a graph coloring problem. Graph coloring, and thus Sudoku, can be viewed as an inference problem in graphical models. Consider a probability distribution over the different assignments in a Sudoku puzzle. The key graphical models idea is to use the graph as a structure which simplifies this probability distribution. I’ll use Sudoku as a running example:

The viewpoint in Tejal’s post was to consider each square in the Sudoku as a vertex of a graph G with edges connecting a vertex (or square) with all vertices  in it’s row, column and 3 \times 3 square. The vertices can take on values (or colors) 1, 2, \cdots 9. The proper coloring constraint then ensures that the completed puzzle is a valid Sudoku solution. In particular, given values on the various vertices one would like to complete the puzzle by assigning values to all vertices.

One can look at this in a slightly different, but essentially related, way. I’ll associate with the Sudoku puzzle, a bipartite graph G(V, F, E). The vertices in V correspond to the squares (of which there are 3^4 = 81). The vertices in F correspond to constraints: there is a vertex in F for each row, column and 3 \times 3 square. A vertex v \in V and a constraint a \in F are connected by an edge if a constrains v. Finally, each vertex in V takes on a (discrete) value x_v \in \mathcal{X}. The interesting part is the following:

  1. One considers a probability distribution over the configurations of values on the vertices.
  2. This distribution is a product of factors each involving neighbors of a single constraint variable.

Formally, let x denote the vector of values assigned to each vertex in V. We then consider a probability distribution:

\mu (x) = \frac{1}{Z} \prod_{a \in F} \psi(x_{\partial a})

Here Z is an appropriate normalizing constant and x_{\partial a} denotes the restriction of x to the neighborhood \partial a of the constraint a.  We let X denote the random vector drawn from such a distribution. For the Sudoku puzzle, we would like \mu (x) to be positive only if all constraints are satisfied. We therefore may choose:

\psi(x_{\partial a}) = \mathbb{I}(\text{entries in } x_{\partial a} \text{ are different})

The distribution then takes the value \frac{1}{Z} at each valid Sudoku solution. The normalizing constant Z is also called the ‘partition function’ and, here, counts the number of valid solutions.

How does this elaborate framework help us to solve the Sudoku puzzle? One way is to use the above distribution and sample from it using the law of total probability. Let A denote the set of vertices where the values are known. Wlog, let V = \{1, 2, \cdots n\} := [n] and V\backslash A = [k]. Then we might obtain a solution by sampling from \mu(x_{V\backslash A} \vert x_A) which can be done sequentially as:

\mu(x_{V\backslash A} \vert x_A) = \mu(x_1\vert x_A) \mu(x_2 \vert x_1, x_A) \cdots \mu(x_k \vert x_1, x_2, \cdots x_{k-1}, x_A)

The problem in this scheme is that we do not know the distribution \mu completely, since we do not know how to compute the partition function efficiently (of course, one way would be to enumerate all the solutions, but that is what we are trying to avoid!).  Thus, we cannot compute the marginal probabilities as above. This is where the graph structure can help, since although we cannot compute \mu,  we can compute certain conditional distributions fairly easily. For instance, suppose one knew all but one square in a row, then it is easy to predict the value of that square. In general, inference tasks can be classified as one of the following:

  1. Sampling: to give a sample X \sim \mu
  2. Compute marginals: for a subset of vertices A, compute \mu(x_A) = \text{Prob}(X_A = x_A)
  3. Compute conditional probabilities: for subsets A and B, compute \mu(x_A \vert x_B) = \text{Prob}(X_A = x_A \vert X_B = x_B)
  4. Compute the partition function Z

Reductions between these tasks have been studied in MCMC literature (see this paper or this one).

If the underlying graph G(V, F, E) is a tree, then these inference tasks become easy and can be solved in time that is linear in the number of variables (assuming bounded degree constraints and bounded \vert \mathcal{X} \vert) using an algorithm known as belief propagation.

Let G be a tree rooted at the first vertex and suppose we want to find the marginal at the root, i.e. \mu (x_1). Throughout below, I will ignore the normalization involved and use the \cong to indicate this.

\mu(x_1) \cong \sum_{x_{V \backslash 1}}\prod_{a \in F} \psi(x_{\partial a})

Let the children of 1 be denoted by a_1, a_2 \cdots a_\Delta and the subtrees rooted at them by G_{a_1 \rightarrow 1} etc. Then using the distributive property x\cdot y + x\cdot z = x \cdot(y+z)  :

\mu(x_1) \cong \left[ \sum_{x_{G_{a_1 \rightarrow 1}}} \prod_{b \in F_{a_1 \rightarrow 1}} \psi(x_{\partial b}) \right]\left[ \sum_{x_{G_{a_2 \rightarrow 1}}} \prod_{b \in F_{a_2 \rightarrow 1}} \psi(x_{\partial b}) \right]\cdots \\  . \qquad\cong \nu_{a_1 \rightarrow 1}(x_1) \nu_{a_2\rightarrow 1}(x_1) \cdots \nu_{a_\Delta \rightarrow 1}(x_1)

Each term in the braces is a “marginal” of the root with respect to the subgraph rooted at a child and is denoted as \nu_{a \rightarrow 1} in the second line. Consider any single child a_1 and suppose its children are, wlog, \{2, 3, \cdots \tau\}. We recursively rewrite the marginal as:

\nu_{a_1 \rightarrow 1}(x_1) \cong \sum_{x_{G_{a_1 \rightarrow 1}}} \prod_{b \in F_{a_1 \rightarrow 1}} \psi(x_{\partial b})\\  . \quad \cong \sum_{x_{\partial a_1 \backslash 1}}\psi(x_{\partial a_1}) \left[ \sum_{x_{G_{2 \rightarrow a_1}}} \prod_{b \in F_{2 \rightarrow a_1}}\psi(x_{\partial b})\right]\left[ \sum_{x_{G_{3 \rightarrow a_1}}} \prod_{b \in F_{3 \rightarrow a_1}}\psi(x_{\partial b})\right] \cdots \\  .\quad \cong \sum_{x_{\partial a_1 \backslash 1}}\psi(x_{\partial a_1})\nu_{2 \rightarrow a_1}(x_2)\nu_{3 \rightarrow a_1}(x_3)\cdots \nu_{\tau \rightarrow a_1}(x_\tau)

The above equations for the update rules for the belief propagation algorithm. At iteration t:

  1. In the first step, the variable vertices send messages to their neighboring constraints according to:\nu_{i \rightarrow a}^{(t)}(x_i) = \prod_{b \in \partial i \backslash a} \nu_{b \rightarrow i}^{(t-1)}(x_i)
  2. In the next step the constraint variables send messages to the neighboring variable vertices as:\nu_{a \rightarrow i}^{(t)}(x_i) = \sum_{x_{\partial a \backslash i}} \psi(x_{\partial a}) \prod_{j \in \partial a \backslash i} \nu_{j \rightarrow a}^{(t)}(x_j)

The initialization may be made to the uniform distribution. The marginals are computed as in the rule given in step 1 above, except that all neighbors of the vertex are included.

On a tree graph, it is not hard to see that the above algorithm converges to a fixed point and yields the correct marginals in \textrm{diam}(G) iterations, where \textrm{diam}(G) denotes the diameter of the graph, or the longest distance between two vertices in a graph. In our running example of Sudoku, however, the underlying graph is not a tree and, in fact, contains many small cycles. In general, the algorithm is not guaranteed to converge, except on graphs with exactly one cycle. Even then, it is not guaranteed to converge to the correct marginals. However, one may still run the above algorithm until convergence (or for some maximum number of iterations) in an attempt to “solve” the Sudoku puzzle.

I am not sure how good belief propagation actually is to solve Sudoku. There are some examples in literature where the algorithms and some generalizations have been investigated, but I have not read them very closely. It was a nice way to revisit some of the very interesting concepts I have seen a few months ago in a course my advisor, Andrea Montanari, teaches on graphical models. The course website contains a list of references on the material, among other things. Other articles of interest are this one on Wikipedia.

Starting Anew…

September 8, 2012

This blog is mainly to document much of the interesting stuff I see during my research.