The way to obtain this is pretty standard. For instance the first inequality can be obtained as:

where and the inequality is by Markov. Optimizing over yields the result. This tells us that concentrates around the expected value of 1 at least to the extent of for a small constant . In fact one can do better than this using an exact large deviations calculation, which I will probably leave to another post.

Thus for large dimensions , the probability mass of the distribution is concentrated in a shell of thickness and mean radius 1, which intuitively, looks very much like . The question that interested me is how we can prove the converse: assuming where denotes the spherical measure on do the marginals of each entry, say look Gaussian? More precisely, let denote the measure of on . Then, does converge weakly to ? This is, in fact, true and is an old result due to Borel in his (book/treatise) “Introduction géometrique à quelques théorie physiques”. I will sketch a simple argument via characteristic functions.

The function is the characteristic function of , where . We can compute this integral explicitly. The formula for the “area” of an -dimensional spherical cap as given (here) is:

where is the surface area of , is the depth and is the normalized incomplete beta function. Differentiating this, dividing by and noting that , we can compute the characteristic function as the following integral:

where is the beta function. To do the above, we have to take care of a few sign conventions, which I have omitted. Now, for fixed and large by Stirling’s approximation. With this and a change of variables , we obtain:

by dominated convergence. The last limit is precisely what we require, namely the characteristic function where . The rest follows from Lévy’s continuity theorem.

There are significant improvements to this result. Diaconis and Freedman prove that this holds for all low-dimensional marginals, even in the sense of variation distance. More precisely, let denote the restriction of to the first entries. They give a sharp bound on the variation distance between the law of and . Interesting generalizations include this result by D’Aristotile, Diaconis and Newman which says that when is a orthogonal matrix of size chosen uniformly at random, and , and others which state that even a “small” submatrix of behaves as though its entries were iid standard normals.

I am sure I have barely scratched the surface here, but it is clear that many (and sometimes intuitively plausible) approximations of this kind are actually fairly accurate even rigorously.

UPDATE (04/05/13): Thanks to Jiantao Jiao for some corrections above.

]]>By the pigeonhole principle one would have to collect at least coupons, but this is not enough. With probability one would get at least one overlap and thus, at least one bin would be empty. If is the (random) number of coupons required to cover all the types, then the task is to find .

It is useful to approach this in an inductive fashion: assume that types have already been collected. Since the number of selections required to get a new type is a geometric random variable of success probability . Thus, in expectation, one needs tries to get a new type after collecting types. This yields that, starting from nothing, one needs tries in expectation, where denotes the harmonic sum upto . This number is roughly yielding that .

More interesting forms of “coupon-collecting” type questions are, say, the number of tries are needed to obtain a coupon of each type with high probability (and not just in expectation). Or, assuming one collected coupons u.a.r as before, what would be the rough size of the coupon type with maximum collection. It is interesting how relatively simple problems arise in unexpected places. I started this post after noting that the same “coupon collector” effect yields a lower bound of randomly sampled entries to recover a rank matrix of size (as proved Tao and Candes’ paper on low rank matrix recovery).

]]>

The viewpoint in Tejal’s post was to consider each square in the Sudoku as a vertex of a graph with edges connecting a vertex (or square) with all vertices in it’s row, column and square. The vertices can take on values (or colors) . The proper coloring constraint then ensures that the completed puzzle is a valid Sudoku solution. In particular, given values on the various vertices one would like to complete the puzzle by assigning values to all vertices.

One can look at this in a slightly different, but essentially related, way. I’ll associate with the Sudoku puzzle, a bipartite graph . The vertices in correspond to the squares (of which there are ). The vertices in correspond to constraints: there is a vertex in for each row, column and square. A vertex and a constraint are connected by an edge if constrains . Finally, each vertex in takes on a (discrete) value . The interesting part is the following:

- One considers a probability distribution over the configurations of values on the vertices.
- This distribution is a product of factors each involving neighbors of a
*single*constraint variable.

Formally, let denote the vector of values assigned to each vertex in . We then consider a probability distribution:

Here is an appropriate normalizing constant and denotes the restriction of to the neighborhood of the constraint . We let denote the random vector drawn from such a distribution. For the Sudoku puzzle, we would like to be positive only if all constraints are satisfied. We therefore may choose:

The distribution then takes the value at each valid Sudoku solution. The normalizing constant is also called the ‘partition function’ and, here, counts the number of valid solutions.

How does this elaborate framework help us to solve the Sudoku puzzle? One way is to use the above distribution and sample from it using the law of total probability. Let denote the set of vertices where the values are known. Wlog, let and . Then we might obtain a solution by sampling from which can be done sequentially as:

The problem in this scheme is that we do not know the distribution completely, since we do not know how to compute the partition function efficiently (of course, one way would be to enumerate all the solutions, but that is what we are trying to avoid!). Thus, we cannot compute the marginal probabilities as above. This is where the graph structure can help, since although we cannot compute , we can compute certain conditional distributions fairly easily. For instance, suppose one knew all but one square in a row, then it is easy to predict the value of that square. In general, inference tasks can be classified as one of the following:

- Sampling: to give a sample
- Compute marginals: for a subset of vertices , compute
- Compute conditional probabilities: for subsets and , compute
- Compute the partition function

Reductions between these tasks have been studied in MCMC literature (see this paper or this one).

If the underlying graph is a tree, then these inference tasks become easy and can be solved in time that is linear in the number of variables (assuming bounded degree constraints and bounded ) using an algorithm known as belief propagation.

Let be a tree rooted at the first vertex and suppose we want to find the marginal at the root, i.e. . Throughout below, I will ignore the normalization involved and use the to indicate this.

Let the children of 1 be denoted by and the subtrees rooted at them by etc. Then using the distributive property :

Each term in the braces is a “marginal” of the root with respect to the subgraph rooted at a child and is denoted as in the second line. Consider any single child and suppose its children are, wlog, . We recursively rewrite the marginal as:

The above equations for the update rules for the belief propagation algorithm. At iteration :

- In the first step, the variable vertices send messages to their neighboring constraints according to:
- In the next step the constraint variables send messages to the neighboring variable vertices as:

The initialization may be made to the uniform distribution. The marginals are computed as in the rule given in step 1 above, except that all neighbors of the vertex are included.

On a tree graph, it is not hard to see that the above algorithm converges to a fixed point and yields the correct marginals in iterations, where denotes the diameter of the graph, or the longest distance between two vertices in a graph. In our running example of Sudoku, however, the underlying graph is not a tree and, in fact, contains many small cycles. In general, the algorithm is not guaranteed to converge, except on graphs with exactly one cycle. Even then, it is not guaranteed to converge to the correct marginals. However, one may still run the above algorithm until convergence (or for some maximum number of iterations) in an attempt to “solve” the Sudoku puzzle.

I am not sure how good belief propagation actually is to solve Sudoku. There are some examples in literature where the algorithms and some generalizations have been investigated, but I have not read them very closely. It was a nice way to revisit some of the very interesting concepts I have seen a few months ago in a course my advisor, Andrea Montanari, teaches on graphical models. The course website contains a list of references on the material, among other things. Other articles of interest are this one on Wikipedia.

]]>