JSolve - yet another really fast brute force Sudoku solver

JasonLion

I have been exploring the ideas in Brian Turner's BB Sudoku, and have found numerous small improvements, which collectively add up to a substantial speed improvement. Along the way I rewrote everything from scratch, designed for portability, and tried to write readable and well documented code. The result is the fasted brute force solver that I am aware of.

You can download the source at http://www.enjoysudoku.com/JSolve11.zip. This version has substantial improvements compared to the 1.0 version, which I mentioned briefly in another thread.

Timing was done on MacOS with a 2.66 GHz Core 2 Duo. I compiled all three programs with gcc -O3 in both 32 and 64 bit modes. fsss is version 8.1 and always used the '*' command line option. BB Sudoku is version 1.0 and always used the -D0 -S2 command line options. Times shown are from the user CPU time of the 'time' command.

briturner · Posted: Mon Jan 18, 2010 5:52 pm Post subject:

Good job. I will need to look through the tweaks, see what you did.

I did try 64 bit on my Windows computer, but for me, 64 bit wa actually slower that 32 bit. This is probably due to the compiler / operating system.

keep up the good work.
Brit

dobrichev · Posted: Thu Jan 21, 2010 9:19 pm Post subject:

Really good results!

There is next version of fsss at http://sites.google.com/site/dobrichev/fsss/fsss_8.2.zip with some improvement in initialization (~7%) and time measurement changed to count I/O for compatible results.

I made some testing on Core 2 Duo processors and found the time proportions are changed in direction that both BB and JS1.0 code runs better (but still a bit slower) on these processors then on my Pentium D and AMD Sempron machines.

IMO there is no advantage from 64 bit architecture for these (BB, JS, and FSSS) algorithms. Probably the origin of the better performance in 64 bit mode is somewhere in:
- compiler specifics;
- slow 32 bit emulation layer;
- usage of char and short data types (In FSSS changing "typedef unsigned short bitmap" to "unsigned int" may help).

When I have time I will take a look at your new code and do similar comparison on my platform.

BTW, adding some links in this thread to datasets used would simplify the testing process.

JasonLion · Posted: Thu Jan 21, 2010 10:26 pm Post subject:

You can get top1465 from here;
sudoku17 comes from here;
multipuzz is from this post, but doesn't appear to be available any more;
get top50000 here;
Tarek Pearly 6000 is from here, the complete download doesn't seem to be available but the puzzles can be extracted from the posts;
GenPuzzles 500K comes from here.

On x86 family processors, 64 bit mode has 14 general purpose registers, while 32-bit mode only has 6. If the compiler is able to take full advantage of the extra registers, it ought to make a noticeable difference. I have started doing some tuning in 32 bit mode and have found that I don't want to unroll as many loops in 32 bit as I do in 64 bit. That makes sense to me given the difference in the number of registers.

The largest change I made from BB Sudoku is switching the current board into it's own global variable, instead of using the top of the guess stack. That eliminates innumerable subscripts of the guess stack.

The second largest difference in 64 bit mode has been the elimination of the house modified flags used by the hidden singles routine. Recently, in 32 bit mode, I found that putting the house modified flags back in helped things a little.

The other large change is my optimization of clearing possibilities on the neighboring cells when a digit is placed. BB Sudoku has two modes. One simply checks each of the 20 neighboring cells for each cell set. The other tracks which houses have had digits set in them and recalculates each of those houses.

I changed the conditions under which the second mode gets used, now used any time there are four or more cells getting set. I also changed the entire approach used in the second mode so that every cell gets checked once after all are set. Checking 81 cells is faster than checking as few as 10 houses (90 cells). Four cells will usually give you 10 or more houses modified. In the old approach, as many as 27 houses * 9 cells each, or a total of 243 cells, got checked. The average number of houses modified in practice is much closer to 27 than it is to 10, so this can be a big win.

dobrichev · Posted: Fri Jan 22, 2010 1:08 am Post subject:

dobrichev · Posted: Fri Jan 22, 2010 4:21 am Post subject:

JasonLion · Posted: Fri Jan 22, 2010 2:28 pm Post subject:

The queue has advantages and disadvantages. There is some overhead in maintaining the queue. The advantage only comes when four or more items are queued at times other than at the start (which presumably wasn't part of your testing). Removing the queue helps for some puzzles and hurts for others. For one of the test files I use, removing the queue helped a lot. For all of the others, removing the queue hurt a little.

There isn't any obvious way to decide what mix of puzzles to use for timing. Different choices in puzzle selection lead to different approaches in the code being optimal. I decided that removing the queue made the worst case performance worse by just enough that I was willing to give up the improvement in the best case performance, but that decision is essentially arbitrary.

Having a single global variable holding the board does seem to speed things up more than enough to pay for the extra memcpy. The difference in the assembly code for a single access is tiny, but it happens so often that it ends up having a noticeable effect. I was comparing to the subscripted version when I did that. The global pointer approach used in fsss is much closer to the speed of a single global, so the tradeoff might be different.

Creating the solution grid at the end is constrained to a small bit of code that only runs once per puzzle. gcc doesn't do cross source file inlining, so the code is still there in my testing, though the only part that runs is a single conditional which decides not to return the board. Having it return the board added between 0.1% and 1%, depending on the puzzle set.

In my testing against fsss, I always use the '*' command line option, which turns off copying the result back in fsss as well. That case, search for zero, one, or more than one solution and don't copy back the solution, is the crucial case in nearly all common usage.

JasonLion · Posted: Fri Jan 22, 2010 4:31 pm Post subject:

I have an update to JSolve, version 1.2. You can download the source at http://www.enjoysudoku.com/JSolve12.zip. This version has some very minor improvements and several options that can be turned on and off to optimize for specific platforms/compilers. It also includes preliminary settings of those options for MacOS 32 bit, MacOS 64 bit, and Windows 32 bit.

I updated the timings for JSolve 1.2 and fsss V8.2. In addition to the tiny improvement in JSolve 64-bit and noticeable improvement in the JSolve 32-bit, there is a huge improvement for fsss on the GenPuzzle 500K file.

dobrichev · Posted: Fri Jan 22, 2010 10:46 pm Post subject:

Here are the measurements I just did, combined with JasonLion's latest measurements.

JasonLion · Posted: Sat Jan 23, 2010 3:50 am Post subject:

Thank you dobrichev for posting your numbers. Your wonderful table really brings home to me just how different the GenPuzzle 500K test set is. It is the only one I test with that contains invalid puzzles. It is also the one that fsss V8.2 showed the most improvement vs fsss V8.1. It is also the one with the most divergent JL/Dob ratio numbers. It also has, by far, the highest number of puzzles solved/failed per second (Tarek Pearly 6000 has the fewest).

GenPuzzle 500K is a log of 500K sequential puzzles passed to the solver by my random game generator. My puzzle generator builds puzzles up from nothing, adding a couple of clues at a time and testing to see if the puzzle still has at least one solution. The great majority, 486,451 out of 500,000, of the puzzles can not be solved.

My generator already rejects puzzles that have obvious contradictions (two of the same digit in a single house) before calling the solver. Presumably, most of those puzzles are invalidated by the first few applications of hidden singles and locked candidates. That puts most of the processing time into the process queue/set digit routine.

This observation does not lead anywhere that I can figure out yet, but it feels like it ought to eventually.

dobrichev · Posted: Sat Jan 23, 2010 7:28 am Post subject:

GenPuzzle 500K has the highest IO / cruncking ratio and seems that you are counting the user time (ru_utime) but not the user + kernel time (ru_stime).

dobrichev · Posted: Wed Jan 27, 2010 10:54 pm Post subject: fsss v8.3 is ready

Hello again,

fsss_v8.3 is ready. All published releases with few comments are here.

Timings:

JasonLion · Posted: Thu Jan 28, 2010 12:08 am Post subject:

My testing of fsss_v8.3 under MacOS shows drastically worst times than fsss_v8.2 in both 32 and 64 bit modes, and Tarek Pearly 6000 never finished at all in either mode. It would appear that using setjmp/longjmp is not a good approach on MacOS.

JasonLion · Posted: Thu Jan 28, 2010 3:57 pm Post subject:

I replaced setjmp/longjmp with _setjmp/_longjmp in fsss_v8.3 and it when from truly dismal to only somewhat slower than fsss_v8.2, but it still never completes Tarek Pearly 6000 (I gave up after 3 hours).

gsf · Posted: Thu Jan 28, 2010 5:36 pm Post subject: