I recently read this post by Tejal Bhamre on Sudoku as a graph coloring problem. Graph coloring, and thus Sudoku, can be viewed as an inference problem in graphical models. Consider a probability distribution over the different assignments in a Sudoku puzzle. The key graphical models idea is to use the graph as a structure which simplifies this probability distribution. I’ll use Sudoku as a running example:
The viewpoint in Tejal’s post was to consider each square in the Sudoku as a vertex of a graph with edges connecting a vertex (or square) with all vertices in it’s row, column and
square. The vertices can take on values (or colors)
. The proper coloring constraint then ensures that the completed puzzle is a valid Sudoku solution. In particular, given values on the various vertices one would like to complete the puzzle by assigning values to all vertices.
One can look at this in a slightly different, but essentially related, way. I’ll associate with the Sudoku puzzle, a bipartite graph . The vertices in
correspond to the squares (of which there are
). The vertices in
correspond to constraints: there is a vertex in
for each row, column and
square. A vertex
and a constraint
are connected by an edge if
constrains
. Finally, each vertex in
takes on a (discrete) value
. The interesting part is the following:
- One considers a probability distribution over the configurations of values on the vertices.
- This distribution is a product of factors each involving neighbors of a single constraint variable.
Formally, let denote the vector of values assigned to each vertex in
. We then consider a probability distribution:
Here is an appropriate normalizing constant and
denotes the restriction of
to the neighborhood
of the constraint
. We let
denote the random vector drawn from such a distribution. For the Sudoku puzzle, we would like
to be positive only if all constraints are satisfied. We therefore may choose:
The distribution then takes the value at each valid Sudoku solution. The normalizing constant
is also called the ‘partition function’ and, here, counts the number of valid solutions.
How does this elaborate framework help us to solve the Sudoku puzzle? One way is to use the above distribution and sample from it using the law of total probability. Let denote the set of vertices where the values are known. Wlog, let
and
. Then we might obtain a solution by sampling from
which can be done sequentially as:
The problem in this scheme is that we do not know the distribution completely, since we do not know how to compute the partition function efficiently (of course, one way would be to enumerate all the solutions, but that is what we are trying to avoid!). Thus, we cannot compute the marginal probabilities as above. This is where the graph structure can help, since although we cannot compute
, we can compute certain conditional distributions fairly easily. For instance, suppose one knew all but one square in a row, then it is easy to predict the value of that square. In general, inference tasks can be classified as one of the following:
- Sampling: to give a sample
- Compute marginals: for a subset of vertices
, compute
- Compute conditional probabilities: for subsets
and
, compute
- Compute the partition function
Reductions between these tasks have been studied in MCMC literature (see this paper or this one).
If the underlying graph is a tree, then these inference tasks become easy and can be solved in time that is linear in the number of variables (assuming bounded degree constraints and bounded
) using an algorithm known as belief propagation.
Let be a tree rooted at the first vertex and suppose we want to find the marginal at the root, i.e.
. Throughout below, I will ignore the normalization involved and use the
to indicate this.
Let the children of 1 be denoted by and the subtrees rooted at them by
etc. Then using the distributive property
:
Each term in the braces is a “marginal” of the root with respect to the subgraph rooted at a child and is denoted as in the second line. Consider any single child
and suppose its children are, wlog,
. We recursively rewrite the marginal as:
The above equations for the update rules for the belief propagation algorithm. At iteration :
- In the first step, the variable vertices send messages to their neighboring constraints according to:
- In the next step the constraint variables send messages to the neighboring variable vertices as:
The initialization may be made to the uniform distribution. The marginals are computed as in the rule given in step 1 above, except that all neighbors of the vertex are included.
On a tree graph, it is not hard to see that the above algorithm converges to a fixed point and yields the correct marginals in iterations, where
denotes the diameter of the graph, or the longest distance between two vertices in a graph. In our running example of Sudoku, however, the underlying graph is not a tree and, in fact, contains many small cycles. In general, the algorithm is not guaranteed to converge, except on graphs with exactly one cycle. Even then, it is not guaranteed to converge to the correct marginals. However, one may still run the above algorithm until convergence (or for some maximum number of iterations) in an attempt to “solve” the Sudoku puzzle.
I am not sure how good belief propagation actually is to solve Sudoku. There are some examples in literature where the algorithms and some generalizations have been investigated, but I have not read them very closely. It was a nice way to revisit some of the very interesting concepts I have seen a few months ago in a course my advisor, Andrea Montanari, teaches on graphical models. The course website contains a list of references on the material, among other things. Other articles of interest are this one on Wikipedia.