View previous topic :: View next topic |
Author |
Message |
| JasonLion
| Joined: 16 Nov 2008 | Posts: 61 | : | Location: Silver Spring, MD | Items |
|
Posted: Sun Jan 17, 2010 6:40 pm Post subject: JSolve - yet another really fast brute force Sudoku solver |
|
|
I have been exploring the ideas in Brian Turner's BB Sudoku, and have found numerous small improvements, which collectively add up to a substantial speed improvement. Along the way I rewrote everything from scratch, designed for portability, and tried to write readable and well documented code. The result is the fasted brute force solver that I am aware of.
You can download the source at http://www.enjoysudoku.com/JSolve11.zip. This version has substantial improvements compared to the 1.0 version, which I mentioned briefly in another thread.
Timing was done on MacOS with a 2.66 GHz Core 2 Duo. I compiled all three programs with gcc -O3 in both 32 and 64 bit modes. fsss is version 8.1 and always used the '*' command line option. BB Sudoku is version 1.0 and always used the -D0 -S2 command line options. Times shown are from the user CPU time of the 'time' command.
Code: |
64-bit 32-bit
Twenty copies of top1465
JSolve V1.1 1.203 1.730
BB Sudoku V1.0 1.563 2.018
fsss V8.1 1.708 2.146
A version of sudoku17 with 48826 puzzles
JSolve V1.1 0.526 0.786
fsss V8.1 0.604 0.743
BB Sudoku V1.0 0.727 0.924
multipuzz
JSolve V1.1 7.896 11.860
fsss V8.1 9.773 12.383
BB Sudoku V1.0 12.571 16.045
top50000
JSolve V1.1 0.719 1.003
fsss V8.1 0.947 1.145
BB Sudoku V1.0 1.115 1.447
Tarek Pearly 6000
JSolve V1.1 0.949 1.343
BB Sudoku V1.0 1.371 1.769
fsss V8.1 1.405 1.759
GenPuzzle 500K
JSovle V1.1 0.865 1.097
BB Sudoku V1.0 1.172 1.501
fsss V8.1 2.254 2.558
|
It is interesting to note that the relative performance of the different programs is different for the 32 and 64 bit versions. I don't show it here, but the relative performance of the different programs also varies with the compiler used. fsss does better on Windows, presumably because the author spent time optimizing it for the Microsoft and Intel compilers on Windows, while I optimized for gcc running under MacOS. |
|
Back to top |
|
|
| briturner
| Joined: 14 Oct 2008 | Posts: 75 | : | | Items |
|
Posted: Mon Jan 18, 2010 5:52 pm Post subject: |
|
|
Good job. I will need to look through the tweaks, see what you did.
I did try 64 bit on my Windows computer, but for me, 64 bit wa actually slower that 32 bit. This is probably due to the compiler / operating system.
keep up the good work.
Brit |
|
Back to top |
|
|
| dobrichev
| Joined: 18 Dec 2009 | Posts: 28 | : | | Items |
|
Posted: Thu Jan 21, 2010 9:19 pm Post subject: |
|
|
Really good results!
There is next version of fsss at http://sites.google.com/site/dobrichev/fsss/fsss_8.2.zip with some improvement in initialization (~7%) and time measurement changed to count I/O for compatible results.
I made some testing on Core 2 Duo processors and found the time proportions are changed in direction that both BB and JS1.0 code runs better (but still a bit slower) on these processors then on my Pentium D and AMD Sempron machines.
IMO there is no advantage from 64 bit architecture for these (BB, JS, and FSSS) algorithms. Probably the origin of the better performance in 64 bit mode is somewhere in:
- compiler specifics;
- slow 32 bit emulation layer;
- usage of char and short data types (In FSSS changing "typedef unsigned short bitmap" to "unsigned int" may help).
When I have time I will take a look at your new code and do similar comparison on my platform.
BTW, adding some links in this thread to datasets used would simplify the testing process. |
|
Back to top |
|
|
| JasonLion
| Joined: 16 Nov 2008 | Posts: 61 | : | Location: Silver Spring, MD | Items |
|
Posted: Thu Jan 21, 2010 10:26 pm Post subject: |
|
|
You can get top1465 from here;
sudoku17 comes from here;
multipuzz is from this post, but doesn't appear to be available any more;
get top50000 here;
Tarek Pearly 6000 is from here, the complete download doesn't seem to be available but the puzzles can be extracted from the posts;
GenPuzzles 500K comes from here.
On x86 family processors, 64 bit mode has 14 general purpose registers, while 32-bit mode only has 6. If the compiler is able to take full advantage of the extra registers, it ought to make a noticeable difference. I have started doing some tuning in 32 bit mode and have found that I don't want to unroll as many loops in 32 bit as I do in 64 bit. That makes sense to me given the difference in the number of registers.
The largest change I made from BB Sudoku is switching the current board into it's own global variable, instead of using the top of the guess stack. That eliminates innumerable subscripts of the guess stack.
The second largest difference in 64 bit mode has been the elimination of the house modified flags used by the hidden singles routine. Recently, in 32 bit mode, I found that putting the house modified flags back in helped things a little.
The other large change is my optimization of clearing possibilities on the neighboring cells when a digit is placed. BB Sudoku has two modes. One simply checks each of the 20 neighboring cells for each cell set. The other tracks which houses have had digits set in them and recalculates each of those houses.
I changed the conditions under which the second mode gets used, now used any time there are four or more cells getting set. I also changed the entire approach used in the second mode so that every cell gets checked once after all are set. Checking 81 cells is faster than checking as few as 10 houses (90 cells). Four cells will usually give you 10 or more houses modified. In the old approach, as many as 27 houses * 9 cells each, or a total of 243 cells, got checked. The average number of houses modified in practice is much closer to 27 than it is to 10, so this can be a big win. |
|
Back to top |
|
|
| dobrichev
| Joined: 18 Dec 2009 | Posts: 28 | : | | Items |
|
Posted: Fri Jan 22, 2010 1:08 am Post subject: |
|
|
JasonLion wrote: | On x86 family processors, 64 bit mode has 14 general purpose registers, while 32-bit mode only has 6. |
Agree.
This may cause some of the other interesting decisions in code to gain more normal (intuitive) form. I'll post after rewieving the code. |
|
Back to top |
|
|
| dobrichev
| Joined: 18 Dec 2009 | Posts: 28 | : | | Items |
|
Posted: Fri Jan 22, 2010 4:21 am Post subject: |
|
|
JasonLion wrote: | The largest change I made from BB Sudoku is switching the current board into it's own global variable, instead of using the top of the guess stack. That eliminates innumerable subscripts of the guess stack. |
Using indexes is not a disaster. You may try using a pointer to stack instead of memcpy-ing (which as side effect can mess the optimization) or just try with index like in the original BB code. In FSSS I use the thread stack for whole context (no global variables) and guess context as local variable in the guessing procedure. The benefit is one memcpy when initializing the guess context. The price is the several returns exiting the recursion.
JasonLion wrote: | I changed the conditions under which the second mode gets used, now used any time there are four or more cells getting set. |
That should work fine. In FSSS I dropped the queue years ago measuring it as inefficient. In the latest version I started processing only the initial clues at once resulting in 7% faster code.
I found you are storing the solution in a "raw" form, and converting to string at the end, avoiding unnecessary conversion during the unsuccessful guesses. Also I found you made most of the changes I once suggested Brian about BB 1.0 code in other thread.
I have one remark concerning the calling procedure (in JSolveMain.c). You are calling JSolve once with constant parameter 0 for the result buffer this telling the optimizer to remove all the code relevant to the solution string. Most of the compilers do it. You may cheat it with code like if(argc!=1000) rb=0; else rb=(char*)&something;.
Finally, in the main loop
Code: | while (stack_depth>=0) { // Keep going till total failure
if (queue_depth>0) ProcessQueue(); // Process everything on the queue
if (cells_remaining>0 && !invalid_board) { // if not finished
HiddenSingles(); // try hidden singles
if (queue_depth<=0 && !invalid_board) { // if that didn't do anything, try locked candidates
LockedCandidates();
// if that didn't do anything, guess
if (queue_depth<=0 && !invalid_board) Guess();
}
}
...
| you are going to guess even if LockedCandidates chaecking eliminated some candidates but doesn't find a digit. There is scenario a Hidden Single to occur after this elimination. I made experiments and found that in 32-bit compilations repeating the loop in such cases works slower than guessing. I didn't try more complex logic to check for Hidden Singles and if not found to go to guessing w/o secondary check for Locked Candidates. This is a room for improvement.
In FSSS code I changed the two 4*3 loops to one 12 loop + lookup. The benefit is questionable and if exists it is less than 3-4%.
Actually the only two major differences between JS and FSSS are 1) queue, and 2) guessing mechanism and respective data representation. I am glad to see that we together with Brian are going in the same direction. |
|
Back to top |
|
|
| JasonLion
| Joined: 16 Nov 2008 | Posts: 61 | : | Location: Silver Spring, MD | Items |
|
Posted: Fri Jan 22, 2010 2:28 pm Post subject: |
|
|
The queue has advantages and disadvantages. There is some overhead in maintaining the queue. The advantage only comes when four or more items are queued at times other than at the start (which presumably wasn't part of your testing). Removing the queue helps for some puzzles and hurts for others. For one of the test files I use, removing the queue helped a lot. For all of the others, removing the queue hurt a little.
There isn't any obvious way to decide what mix of puzzles to use for timing. Different choices in puzzle selection lead to different approaches in the code being optimal. I decided that removing the queue made the worst case performance worse by just enough that I was willing to give up the improvement in the best case performance, but that decision is essentially arbitrary.
Having a single global variable holding the board does seem to speed things up more than enough to pay for the extra memcpy. The difference in the assembly code for a single access is tiny, but it happens so often that it ends up having a noticeable effect. I was comparing to the subscripted version when I did that. The global pointer approach used in fsss is much closer to the speed of a single global, so the tradeoff might be different.
Creating the solution grid at the end is constrained to a small bit of code that only runs once per puzzle. gcc doesn't do cross source file inlining, so the code is still there in my testing, though the only part that runs is a single conditional which decides not to return the board. Having it return the board added between 0.1% and 1%, depending on the puzzle set.
In my testing against fsss, I always use the '*' command line option, which turns off copying the result back in fsss as well. That case, search for zero, one, or more than one solution and don't copy back the solution, is the crucial case in nearly all common usage. |
|
Back to top |
|
|
| JasonLion
| Joined: 16 Nov 2008 | Posts: 61 | : | Location: Silver Spring, MD | Items |
|
Posted: Fri Jan 22, 2010 4:31 pm Post subject: |
|
|
I have an update to JSolve, version 1.2. You can download the source at http://www.enjoysudoku.com/JSolve12.zip. This version has some very minor improvements and several options that can be turned on and off to optimize for specific platforms/compilers. It also includes preliminary settings of those options for MacOS 32 bit, MacOS 64 bit, and Windows 32 bit.
I updated the timings for JSolve 1.2 and fsss V8.2. In addition to the tiny improvement in JSolve 64-bit and noticeable improvement in the JSolve 32-bit, there is a huge improvement for fsss on the GenPuzzle 500K file.
Code: |
64-bit 32-bit
Twenty copies of top1465 <http>
JSolve V1.2 1.192 1.610
BB Sudoku V1.0 1.563 2.018
fsss V8.2 1.661 2.091
A version of sudoku17 with 48826 puzzles <http>
JSolve V1.2 0.522 0.699
fsss V8.2 0.553 0.691
BB Sudoku V1.0 0.727 0.924
multipuzz <http>
JSolve V1.2 7.821 10.531
fsss V8.2 8.735 11.354
BB Sudoku V1.0 12.571 16.045
top50000 <http>
JSolve V1.2 0.715 0.931
fsss V8.2 0.842 1.145
BB Sudoku V1.0 1.115 1.447
Tarek Pearly 6000 <http>
JSolve V1.2 0.944 1.259
BB Sudoku V1.0 1.371 1.769
fsss V8.2 1.418 1.770
GenPuzzle 500K <http>
fsss V8.2 0.653 0.810
JSovle V1.2 0.846 1.069
BB Sudoku V1.0 1.172 1.501
|
|
|
Back to top |
|
|
| dobrichev
| Joined: 18 Dec 2009 | Posts: 28 | : | | Items |
|
Posted: Fri Jan 22, 2010 10:46 pm Post subject: |
|
|
Here are the measurements I just did, combined with JasonLion's latest measurements. Code: | Dobrichev JasonLion JL/Dob
Seconds Ratio Seconds Ratio
Twenty copies of top1465
BB Sudoku V1.0 1.940 1.08 2.018 1.25 1.04
JSolve V1.2 1.796 1.00 1.610 1.00 0.90
fsss V8.2 2.375 1.32 2.091 1.30 0.88
A version of sudoku17 with 48826 puzzles
BB Sudoku V1.0 0.962 1.16 0.924 1.34 0.96
JSolve V1.2 0.828 1.00 0.699 1.01 0.84
fsss V8.2 0.828 1.00 0.691 1.00 0.83
top50000
BB Sudoku V1.0 1.344 1.19 1.447 1.55 1.08
JSolve V1.2 1.125 1.00 0.931 1.00 0.83
fsss V8.2 1.235 1.10 1.145 1.23 0.93
Tarek Pearly 6000
BB Sudoku V1.0 1.463 1.05 1.769 1.41 1.21
JSolve V1.2 1.390 1.00 1.259 1.00 0.91
fsss V8.2 1.985 1.43 1.770 1.41 0.89
GenPuzzle 500K
BB Sudoku V1.0 2.795 1.79 1.501 1.85 0.54
JSolve V1.2 1.968 1.26 1.069 1.32 0.54
fsss V8.2 1.563 1.00 0.810 1.00 0.52
| The Ratio column is the time normalized to the fastest solver.
The rightmost column is the JasonLion's time divided by Dobrichev's time. If the only difference between machines is the clock, this column should be constant. The dispersion of the values in this column is a generalized measure of the influence of the HW, Compiler, OS, and time measurement specifics to the final result.
The solvers are run in the displayed order in a batch, each solver for 4 times for each dataset. The minimal of the 4 measured times is recorded.
The puzzels are solved up to the second solution w/o output. The command lines are as follows:
BB -d0 -t1 -s2 <datafile>> logfile
JS infile >> logfile
FSSS infile * >> logfile
There is significant improvement in the JSolver v1.2 against v1.0.
Congratulations! |
|
Back to top |
|
|
| JasonLion
| Joined: 16 Nov 2008 | Posts: 61 | : | Location: Silver Spring, MD | Items |
|
Posted: Sat Jan 23, 2010 3:50 am Post subject: |
|
|
Thank you dobrichev for posting your numbers. Your wonderful table really brings home to me just how different the GenPuzzle 500K test set is. It is the only one I test with that contains invalid puzzles. It is also the one that fsss V8.2 showed the most improvement vs fsss V8.1. It is also the one with the most divergent JL/Dob ratio numbers. It also has, by far, the highest number of puzzles solved/failed per second (Tarek Pearly 6000 has the fewest).
GenPuzzle 500K is a log of 500K sequential puzzles passed to the solver by my random game generator. My puzzle generator builds puzzles up from nothing, adding a couple of clues at a time and testing to see if the puzzle still has at least one solution. The great majority, 486,451 out of 500,000, of the puzzles can not be solved.
My generator already rejects puzzles that have obvious contradictions (two of the same digit in a single house) before calling the solver. Presumably, most of those puzzles are invalidated by the first few applications of hidden singles and locked candidates. That puts most of the processing time into the process queue/set digit routine.
This observation does not lead anywhere that I can figure out yet, but it feels like it ought to eventually. |
|
Back to top |
|
|
| dobrichev
| Joined: 18 Dec 2009 | Posts: 28 | : | | Items |
|
Posted: Sat Jan 23, 2010 7:28 am Post subject: |
|
|
GenPuzzle 500K has the highest IO / cruncking ratio and seems that you are counting the user time (ru_utime) but not the user + kernel time (ru_stime). |
|
Back to top |
|
|
| dobrichev
| Joined: 18 Dec 2009 | Posts: 28 | : | | Items |
|
Posted: Wed Jan 27, 2010 10:54 pm Post subject: fsss v8.3 is ready |
|
|
Hello again,
fsss_v8.3 is ready. All published releases with few comments are here.
Timings: Code: |
Seconds Ratio
Twenty copies of top1465
BB Sudoku V1.0 1.940 1.19
JSolve V1.2 1.796 1.11
fsss V8.3 1.625 1.00
A version of sudoku17 with 48826 puzzles
BB Sudoku V1.0 0.962 1.31
JSolve V1.2 0.828 1.13
fsss V8.3 0.735 1.00
top50000
BB Sudoku V1.0 1.344 1.25
JSolve V1.2 1.125 1.04
fsss V8.3 1.078 1.00
Tarek Pearly 6000
BB Sudoku V1.0 1.463 1.05
JSolve V1.2 1.390 1.00
fsss V8.3 1.390 1.00
GenPuzzle 500K
BB Sudoku V1.0 2.795 1.81
JSolve V1.2 1.968 1.27
fsss V8.3 1.547 1.00
|
The most interesting change is in the innermost loop of FindLockedCandidates().
I tried calling fsss solver from JSolve v1.2 outer loop (JSolveMain.c) and found that it is working a bit slowly (let say 3-4%, I didn't stored the exact timings). So, to get more realistic data in the comparison above, JSolve's timing has to be scaled down in about 3-4%. Few percents more for BB, which has unusually complex input parsing.
I am expecting on 64-bit platform fsss will run relatively slower due to setjmp/longjmp logic implemented in v8.3 which is more complicated when more context has to be stored. |
|
Back to top |
|
|
| JasonLion
| Joined: 16 Nov 2008 | Posts: 61 | : | Location: Silver Spring, MD | Items |
|
Posted: Thu Jan 28, 2010 12:08 am Post subject: |
|
|
My testing of fsss_v8.3 under MacOS shows drastically worst times than fsss_v8.2 in both 32 and 64 bit modes, and Tarek Pearly 6000 never finished at all in either mode. It would appear that using setjmp/longjmp is not a good approach on MacOS. |
|
Back to top |
|
|
| JasonLion
| Joined: 16 Nov 2008 | Posts: 61 | : | Location: Silver Spring, MD | Items |
|
Posted: Thu Jan 28, 2010 3:57 pm Post subject: |
|
|
I replaced setjmp/longjmp with _setjmp/_longjmp in fsss_v8.3 and it when from truly dismal to only somewhat slower than fsss_v8.2, but it still never completes Tarek Pearly 6000 (I gave up after 3 hours). |
|
Back to top |
|
|
| gsf
| Joined: 18 Aug 2005 | Posts: 408 | : | Location: NJ USA | Items |
|
Posted: Thu Jan 28, 2010 5:36 pm Post subject: |
|
|
JasonLion wrote: | I replaced setjmp/longjmp with _setjmp/_longjmp in fsss_v8.3 and it when from truly dismal to only somewhat slower than fsss_v8.2, but it still never completes Tarek Pearly 6000 (I gave up after 3 hours). |
setjmp/longjmp and/or _setjmp/_longjmp will most likely wreak havoc
on compile/link time sw optimizations as well as runtime hw register/memory caching
even if you do find one os/cpu combination that shows an improvement
the chances are slim that it will act similarly on other os/cpu |
|
Back to top |
|
|
|