We are going to use OpenMP to create a fast parallel Sudoku solver using two different algorithms, Crooks algorithm and SAT solving. Crooks algorithm is a sudoku specific implementation. Sudokus are known to have an equivalent mapping to sat formula, so we plan to make a sat solver that will be pipelined to solve sudoku formulas. We will show our speedup and timing statistics at the poster presentation.
Sudoku is a puzzle in which an N^2x N^2 board must be filled with N^2 numbers each of 1 through N^2 such that each row, column, and NxN box only contains each number once. Traditional Sudoku boards are 9x9. Solving a Sudoku puzzle is an NP-complete problem, meaning there is no known polynomial time solution. This makes Sudoku a great problem for parallelization. The brute force solution is an effective way to solve the puzzle because it will always find a solution. There is a lot of potential for parallelization, as there are a lot of independent regions that need to be filled. Here is an example Sudoku problem (left) and solution (right):
We represented the Sudoku puzzles as arrays. Our algorithms returned the same arrays with the missing numbers filled in. Most of the parallelization occurred in the step where we looked for possible valid numbers for each cell, as these computations were largely independent. Dependencies occured when we updated the board, since this sometimes changed the possible values for other cells.
The humanistic solver uses Crook’s algorithm to trim the number of possibilities it must search before finishing the puzzle with brute force. There are four patterns that Crook’s algorithm considers when trimming possibilities:
Our humanistic solver tries these methods in order. If the solver is able to place numbers using any of the methods, it updates the board and starts again from elimination. We have noticed that most updates come from the “elimination” and “lone rangers” steps. Finding twins and triplets is more difficult and generally takes so much time that it slows down our algorithm. Crook’s algorithm is not always guaranteed to finish the puzzle, as it might not always be possible to fill all the cells with just these four strategies. Therefore, it is necessary to resort to brute force at the end if the puzzle is not finished. The steps that our algorithm takes can be summarized with this flowchart:
We stored the board as well as a list of possibles values for each cell so we wouldn’t have to recalculate the values every time. We stored these possible values as a 2D array of ints. For each square in the original board we had one array of length N + 1 where the first entry was the number of valid numbers for the cell and the ith entry was a 1 if i was a valid entry for that cell. We experimented with twins and triplets but they required so many updates of the board and the possibilities array that it made our implementation slower than the serial version, so we decided not to include these steps. We parallelized elimination over cells, and looked for lone rangers in each row, column, and box in parallel using OpenMP’s parallel for loops. We ran our parallel code on the Xeon Phi. We carefully structured our code so that updates to the board would be made in an order that wouldn’t cause race conditions between threads. This helped us improve our speedup because we didn’t have any synch time.
To convert a sudoku puzzle to a sat formula, for each of the 81 grids in the 9x9 grid, we assign a variable s_x,y where x is the row number, and y is the column number. We then create 9 version of this variable, called s_x,y,z where the z value, ranging from 1 to 9, denotes whether the final grid value was the value z. The formal constraints are listed below:
From the above specifications, we can see that any SAT formula corresponding to a standard sudoku grid will have 9x9x9 = 729 different literals, and has 11,988 clauses, apart from the unit (one literal) clauses representing the pre-assigned entries. Of these clauses, 324 clauses are nine-ary and the remaining 11,664 clauses are binary.
Because of the sheer number of binary clauses, we deemed that used a sat solver algorithm that uses unit propagation (the ability to determine which clauses have only one undecided literal and all other literals are decided false) is important. The DPLL has two optimizations past the simple brute force sat solver algorithm, which is the pure literal rule, and unit propagation. The pure literal rule simply looks to see if any variables appear in only one form (positive or negated) throughout the entire formula, and sets those variables as True in the form they appear.
The simple brute force sat solver algorithm can be thought or as a depth first search through the decision space of all possible literal assignments, where the next unassigned literal is first assigned True, and if a conflict is reached, then the algorithm backtracks and assigned the same literal to be False.
In the simple sequential implementation, we have the benefit of the early detection of a conflict in the formula which tells us not to try any of the assignments with the same prefix as our conflict detected partial assignment. In our parallel implementation, we didn’t want to lose the ability. Therefore, we needed a central queue of work/partial assignments that we know wouldn’t be overlapping with the work of the other worker threads. We had a master thread that held a queue of partial assignments, and a set of available worker threads. There are three message types passed from the worker to master threads:
The master thread would stay in a while loop that listened to the workers until either a solution found message was received from a worker (to which the master thread would then broadcast for all workers to stop) or else, the job queue of partial assignments is empty, and all the worker threads are in the set of available workers. We initialize the program by putting the entirely empty formula into the job queue, and all worker threads in the set of available workers.
If worker threads are not actively processing their partial assignment, then worker threads would listen for a message from the master thread. In the case that their current partial assignment has reached a branching point, the worker thread would take the true version of the branch, and send the false version of the branch as a partial assignment to the master thread.
We measured performance in terms of speedup. To test our performance, we used a difficult 9x9 puzzle and ran it through the sequential solver first. This gave us a baseline. Next, we solved the same puzzle using our parallel solver with n = 1, 2, 4, 8, and 16 threads. While our code is written to handle boards of any size, we stuck with 9x9 since the sequential version took several minutes to solve a 16x16 board.
Since the sequential brute force solver is already quite fast, it was difficult to achieve significant speedup. In addition, there was probably overhead in spawning and scheduling threads. We achieved the greatest speedup between 1 and 2 threads, probably because the board was so small that the more threads we added, the more overhead there was, outweighing the benefits of parallelism. A third factor that could have limited our speedup is that there is a lot of repeated work in finding valid values for each cell, and it was difficult to avoid duplicate computations without excessive locking.
Since our Sudoku board was small enough to fit in the cache, we believe we were not bandwidth bound. Our problem didn’t really require processing data so we weren’t compute bound either.
Due to the sheer number of variables and clauses in our transformation from sudoku to sat formula and the rather weak implementation of our sat solver (in comparison to sat of the art solvers), we knew that using the SAT-Solver would not be the fastest method of sudoku solving implemented in this assignment. The runtime of the SAT-Solver on sudoku formulas ranged between 2-5 minutes for parallel and sequential. For interest, we also timed the total time from reading in a sudoku formula to converting to a sat instance to interpreting the solved sat formula to output a completed sudoku puzzle using the ubuntu package minisat
We can see that there is a significant different in speedup between the sequential and the parallel version using 4 processes. However, from 4 to 8 processes, the difference is notably smaller. The bottleneck for the speedup is probably due to overhead of message passing and the lack of branching for sudoku based sat formulas in particular. We saw that of the 11,988 clauses, 11,664 clauses are binary, meaning that there is a significant amount of unit propagation occurring as sequential computation inside a single process. We see that minisat package had a much smaller runtime than both our sequential and parallel implementations, but this is due more to intelligent sat solving optimizations than taking advantage of parallelism inside the problem.
We divided the work 50-50.