“Solving” Sudoku using Belief Propagation

I recently read this post by Tejal Bhamre on Sudoku as a graph coloring problem. Graph coloring, and thus Sudoku, can be viewed as an inference problem in graphical models. Consider a probability distribution over the different assignments in a Sudoku puzzle. The key graphical models idea is to use the graph as a structure which simplifies this probability distribution. I’ll use Sudoku as a running example:

The viewpoint in Tejal’s post was to consider each square in the Sudoku as a vertex of a graph G with edges connecting a vertex (or square) with all vertices  in it’s row, column and 3 \times 3 square. The vertices can take on values (or colors) 1, 2, \cdots 9. The proper coloring constraint then ensures that the completed puzzle is a valid Sudoku solution. In particular, given values on the various vertices one would like to complete the puzzle by assigning values to all vertices.

One can look at this in a slightly different, but essentially related, way. I’ll associate with the Sudoku puzzle, a bipartite graph G(V, F, E). The vertices in V correspond to the squares (of which there are 3^4 = 81). The vertices in F correspond to constraints: there is a vertex in F for each row, column and 3 \times 3 square. A vertex v \in V and a constraint a \in F are connected by an edge if a constrains v. Finally, each vertex in V takes on a (discrete) value x_v \in \mathcal{X}. The interesting part is the following:

  1. One considers a probability distribution over the configurations of values on the vertices.
  2. This distribution is a product of factors each involving neighbors of a single constraint variable.

Formally, let x denote the vector of values assigned to each vertex in V. We then consider a probability distribution:

\mu (x) = \frac{1}{Z} \prod_{a \in F} \psi(x_{\partial a})

Here Z is an appropriate normalizing constant and x_{\partial a} denotes the restriction of x to the neighborhood \partial a of the constraint a.  We let X denote the random vector drawn from such a distribution. For the Sudoku puzzle, we would like \mu (x) to be positive only if all constraints are satisfied. We therefore may choose:

\psi(x_{\partial a}) = \mathbb{I}(\text{entries in } x_{\partial a} \text{ are different})

The distribution then takes the value \frac{1}{Z} at each valid Sudoku solution. The normalizing constant Z is also called the ‘partition function’ and, here, counts the number of valid solutions.

How does this elaborate framework help us to solve the Sudoku puzzle? One way is to use the above distribution and sample from it using the law of total probability. Let A denote the set of vertices where the values are known. Wlog, let V = \{1, 2, \cdots n\} := [n] and V\backslash A = [k]. Then we might obtain a solution by sampling from \mu(x_{V\backslash A} \vert x_A) which can be done sequentially as:

\mu(x_{V\backslash A} \vert x_A) = \mu(x_1\vert x_A) \mu(x_2 \vert x_1, x_A) \cdots \mu(x_k \vert x_1, x_2, \cdots x_{k-1}, x_A)

The problem in this scheme is that we do not know the distribution \mu completely, since we do not know how to compute the partition function efficiently (of course, one way would be to enumerate all the solutions, but that is what we are trying to avoid!).  Thus, we cannot compute the marginal probabilities as above. This is where the graph structure can help, since although we cannot compute \mu,  we can compute certain conditional distributions fairly easily. For instance, suppose one knew all but one square in a row, then it is easy to predict the value of that square. In general, inference tasks can be classified as one of the following:

  1. Sampling: to give a sample X \sim \mu
  2. Compute marginals: for a subset of vertices A, compute \mu(x_A) = \text{Prob}(X_A = x_A)
  3. Compute conditional probabilities: for subsets A and B, compute \mu(x_A \vert x_B) = \text{Prob}(X_A = x_A \vert X_B = x_B)
  4. Compute the partition function Z

Reductions between these tasks have been studied in MCMC literature (see this paper or this one).

If the underlying graph G(V, F, E) is a tree, then these inference tasks become easy and can be solved in time that is linear in the number of variables (assuming bounded degree constraints and bounded \vert \mathcal{X} \vert) using an algorithm known as belief propagation.

Let G be a tree rooted at the first vertex and suppose we want to find the marginal at the root, i.e. \mu (x_1). Throughout below, I will ignore the normalization involved and use the \cong to indicate this.

\mu(x_1) \cong \sum_{x_{V \backslash 1}}\prod_{a \in F} \psi(x_{\partial a})

Let the children of 1 be denoted by a_1, a_2 \cdots a_\Delta and the subtrees rooted at them by G_{a_1 \rightarrow 1} etc. Then using the distributive property x\cdot y + x\cdot z = x \cdot(y+z)  :

\mu(x_1) \cong \left[ \sum_{x_{G_{a_1 \rightarrow 1}}} \prod_{b \in F_{a_1 \rightarrow 1}} \psi(x_{\partial b}) \right]\left[ \sum_{x_{G_{a_2 \rightarrow 1}}} \prod_{b \in F_{a_2 \rightarrow 1}} \psi(x_{\partial b}) \right]\cdots \\  . \qquad\cong \nu_{a_1 \rightarrow 1}(x_1) \nu_{a_2\rightarrow 1}(x_1) \cdots \nu_{a_\Delta \rightarrow 1}(x_1)

Each term in the braces is a “marginal” of the root with respect to the subgraph rooted at a child and is denoted as \nu_{a \rightarrow 1} in the second line. Consider any single child a_1 and suppose its children are, wlog, \{2, 3, \cdots \tau\}. We recursively rewrite the marginal as:

\nu_{a_1 \rightarrow 1}(x_1) \cong \sum_{x_{G_{a_1 \rightarrow 1}}} \prod_{b \in F_{a_1 \rightarrow 1}} \psi(x_{\partial b})\\  . \quad \cong \sum_{x_{\partial a_1 \backslash 1}}\psi(x_{\partial a_1}) \left[ \sum_{x_{G_{2 \rightarrow a_1}}} \prod_{b \in F_{2 \rightarrow a_1}}\psi(x_{\partial b})\right]\left[ \sum_{x_{G_{3 \rightarrow a_1}}} \prod_{b \in F_{3 \rightarrow a_1}}\psi(x_{\partial b})\right] \cdots \\  .\quad \cong \sum_{x_{\partial a_1 \backslash 1}}\psi(x_{\partial a_1})\nu_{2 \rightarrow a_1}(x_2)\nu_{3 \rightarrow a_1}(x_3)\cdots \nu_{\tau \rightarrow a_1}(x_\tau)

The above equations for the update rules for the belief propagation algorithm. At iteration t:

  1. In the first step, the variable vertices send messages to their neighboring constraints according to:\nu_{i \rightarrow a}^{(t)}(x_i) = \prod_{b \in \partial i \backslash a} \nu_{b \rightarrow i}^{(t-1)}(x_i)
  2. In the next step the constraint variables send messages to the neighboring variable vertices as:\nu_{a \rightarrow i}^{(t)}(x_i) = \sum_{x_{\partial a \backslash i}} \psi(x_{\partial a}) \prod_{j \in \partial a \backslash i} \nu_{j \rightarrow a}^{(t)}(x_j)

The initialization may be made to the uniform distribution. The marginals are computed as in the rule given in step 1 above, except that all neighbors of the vertex are included.

On a tree graph, it is not hard to see that the above algorithm converges to a fixed point and yields the correct marginals in \textrm{diam}(G) iterations, where \textrm{diam}(G) denotes the diameter of the graph, or the longest distance between two vertices in a graph. In our running example of Sudoku, however, the underlying graph is not a tree and, in fact, contains many small cycles. In general, the algorithm is not guaranteed to converge, except on graphs with exactly one cycle. Even then, it is not guaranteed to converge to the correct marginals. However, one may still run the above algorithm until convergence (or for some maximum number of iterations) in an attempt to “solve” the Sudoku puzzle.

I am not sure how good belief propagation actually is to solve Sudoku. There are some examples in literature where the algorithms and some generalizations have been investigated, but I have not read them very closely. It was a nice way to revisit some of the very interesting concepts I have seen a few months ago in a course my advisor, Andrea Montanari, teaches on graphical models. The course website contains a list of references on the material, among other things. Other articles of interest are this one on Wikipedia.


2 Responses to ““Solving” Sudoku using Belief Propagation”

  1. Igor Says:

    Hello Yash,

    Did you implement that algorithm ?


    • Yash Deshpande Says:

      Hi Igor,

      No. I have not had the chance to implement the algorithm and see how it works. I recall coming across papers which did so, but have not surveyed them seriously enough to comment on them.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: