|
View previous topic :: View next topic |
Author |
Message |
| gsf
| Joined: 18 Aug 2005 | Posts: 408 | : | Location: NJ USA | Items |
|
Posted: Fri Oct 23, 2009 2:44 am Post subject: |
|
|
Code: | #define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x]; ChangedGroup[*(ucp++)] = 1; ChangedGroup[*(ucp++)] = 1; ChangedGroup[*ucp] = Changed = 1; }
|
JasonLion wrote: | The order of execution of post-increments relative to the expression they are in is explicitly not defined and the compiler is allowed to do it in a number of different ways. The old Borland C compiler would delay all post increments until the very end of the expression (after the assignments).
|
the ';' forces the compiler to (effectively) complete all computations before proceeding to the next statement
so the macro is ok on comforming compilers
however, I think the optimizer for the original macro is not doing a good job
Code: | a[b][0] = a[b][1] = a[b][2] = 1 |
should not have to re-evaluate a[b] for each part
and the constant suffixes should be another optimizer win
there are plenty of examples across architectures where switching from arrays to pointers is a lose
moving 64 bit pointers vs small const indices might also slow down an inner loop |
|
Back to top |
|
|
| JasonLion
| Joined: 16 Nov 2008 | Posts: 61 | : | Location: Silver Spring, MD | Items |
|
Posted: Fri Oct 23, 2009 3:09 am Post subject: |
|
|
Just to be clear, I was commenting on the version without semi-colons between the assignments:
Code: | #define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x]; ChangedGroup[*(ucp++)] = ChangedGroup[*(ucp++)] = ChangedGroup[*(ucp++)] = Changed = 1; } |
I wonder how this one does on the various compilers:
Code: | #define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x]; ChangedGroup[ucp[0]] = ChangedGroup[[ucp[1]] = ChangedGroup[ucp[2]] = Changed = 1; } |
|
|
Back to top |
|
|
| briturner
| Joined: 14 Oct 2008 | Posts: 75 | : | | Items |
|
Posted: Fri Oct 23, 2009 3:13 am Post subject: |
|
|
Greetings,
I have tried a couple ideas people have posted here. This is what I saw:
1) Using a pointer to replace all G[PIdx]. This appears that it should save the evaluation of G[PIdx] in each instance. My testing showed a small speed up for (887 to 881 msec for the pearly 6000).
2) changing the MarkChanged macro. Again, the idea is for the compiler to not calculate the index. This actually caused my solver so slow considerably. I had to make each assignment a separate statement to make it work (rather than everything set to the final 1, I have 4 '= 1' statements). After that change, the timing was just slightly longer (just a fraction of a percent). Using [upc[0]]... was also in the noise.
Conclusion: These suggested changes can be helpful (10% speed up or more were seen by others), but the speedups are very dependent on the compiler and how well it optimizes. For me, they provided at best less than a 1% improvement.
One final note. On the discussion of the portability of the ++ statements in a series of assignments. For me, I found that the suggested macro did not work correctly (stats showed that is doubled much of steps, like number of naked singles). Reworking the macro to be separate assignments restored the portability.
Brit |
|
Back to top |
|
|
| PIsaacson
| Joined: 17 Jan 2006 | Posts: 47 | : | Location: Campbell, CA | Items |
|
Posted: Fri Oct 23, 2009 7:23 am Post subject: |
|
|
Using gcc 4.4.1 optimize level 3 and objdump to decompile
#define MarkChanged(x) { ChangedGroup[C2Grp[x][0]] = ChangedGroup[C2Grp[x][1]] = ChangedGroup[C2Grp[x][2]] = Changed = 1; } Code: | MarkChanged(t2); // Mark groups as changed
ad3: 8d 1c 5b lea (%ebx,%ebx,2),%ebx
ad6: c7 85 54 ff ff ff 01 movl $0x1,-0xac(%ebp)
add: 00 00 00
ae0: 0f b6 b3 40 00 00 00 movzbl 0x40(%ebx),%esi
ae7: 0f b6 bb 41 00 00 00 movzbl 0x41(%ebx),%edi
aee: 0f b6 9b 42 00 00 00 movzbl 0x42(%ebx),%ebx
af5: c6 83 f0 d0 00 00 01 movb $0x1,0xd0f0(%ebx)
afc: c6 87 f0 d0 00 00 01 movb $0x1,0xd0f0(%edi)
b03: c6 86 f0 d0 00 00 01 movb $0x1,0xd0f0(%esi) |
#define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x]; ChangedGroup[*(ucp++)] = 1; ChangedGroup[*(ucp++)] = 1; ChangedGroup[*ucp] = Changed = 1; } Code: | MarkChanged(t2); // Mark groups as changed
5f7b: 8d 34 5b lea (%ebx,%ebx,2),%esi
5f7e: 8d 9e 40 00 00 00 lea 0x40(%esi),%ebx
5f84: 0f b6 b6 40 00 00 00 movzbl 0x40(%esi),%esi
5f8b: c6 86 f0 d0 00 00 01 movb $0x1,0xd0f0(%esi)
5f92: 0f b6 73 01 movzbl 0x1(%ebx),%esi
5f96: c6 86 f0 d0 00 00 01 movb $0x1,0xd0f0(%esi)
5f9d: 0f b6 5b 02 movzbl 0x2(%ebx),%ebx
5fa1: c7 85 50 ff ff ff 01 movl $0x1,-0xb0(%ebp)
5fa8: 00 00 00
5fab: c6 83 f0 d0 00 00 01 movb $0x1,0xd0f0(%ebx) |
I would say there is a slight performance advantage to the original (one less instruction), but the register usage is less in the pointer macro.
Using a sudoku grid generator program that makes numerous calls to Solver, the timings indicated that the original macro was somewhat faster.
Original: Code: | 41> time suexg 0 10000 > /dev/null
real 0m9.234s
user 0m0.015s
sys 0m0.015s
42> time suexg 0 10000 > /dev/null
real 0m9.235s
user 0m0.015s
sys 0m0.000s |
Modified: Code: | 45> time suexg 0 10000 > /dev/null
real 0m9.531s
user 0m0.015s
sys 0m0.015s
45> time suexg 0 10000 > /dev/null
real 0m9.532s
user 0m0.015s
sys 0m0.015s |
|
|
Back to top |
|
|
| daj95376
| Joined: 05 Feb 2006 | Posts: 349 | : | | Items |
|
Posted: Fri Oct 23, 2009 8:28 am Post subject: |
|
|
I put the following code through my Borland C++ 5.5.1 compiler:
Code: | #define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x];
ChangedGroup[*(ucp++)] =
ChangedGroup[*(ucp++)] =
ChangedGroup[*(ucp++)] = Changed = 1; }
uint8 C2Grp[1][3] = {0,1,2};
int Changed, ChangedGroup[3];
ChangedGroup[0] = ChangedGroup[1] = ChangedGroup[2] = 3;
MarkChanged(0);
printf( "ChangeGroup: %d %d %d\n",
ChangedGroup[0], ChangedGroup[1], ChangedGroup[2] );
|
The output is as I (and Jason) expected:
When I changed the macro to:
Code: | #define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x];
ChangedGroup[*(ucp++)] = 1;
ChangedGroup[*(ucp++)] = 1;
ChangedGroup[* ucp ] = Changed = 1; }
|
The output changed to:
The following macro also worked:
Code: | #define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x];
ChangedGroup[ucp[0]] =
ChangedGroup[ucp[1]] =
ChangedGroup[ucp[2]] = Changed = 1; }
|
FWIW: I believe operator precedence dictates that sudoking's altered macro, when compiled with some other compiler than mine, is actually equivalent to:
Code: | #define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x];
ChangedGroup[ucp[2]] =
ChangedGroup[ucp[1]] =
ChangedGroup[ucp[0]] = Changed = 1; }
|
This doesn't make a difference in this instance, but it would if "=" was replaced by "+=" in the statement. |
|
Back to top |
|
|
| sudoking
| Joined: 20 Oct 2009 | Posts: 40 | : | | Items |
|
Posted: Fri Oct 23, 2009 8:48 am Post subject: |
|
|
JasonLion wrote: | The order of execution of post-increments relative to the expression they are in is explicitly not defined and the compiler is allowed to do it in a number of different ways. The old Borland C compiler would delay all post increments until the very end of the expression (after the assignments). |
I wasn't aware of that. Thanx. I'll keep that in mind in the future.
JasonLion wrote: | I wonder how this one does on the various compilers:
Code: | #define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x]; ChangedGroup[ucp[0]] = ChangedGroup[[ucp[1]] = ChangedGroup[ucp[2]] = Changed = 1; } |
|
It would cause a complier error.
But this:
Code: | #define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x]; ChangedGroup[ucp[0]] = ChangedGroup[ucp[1]] = ChangedGroup[ucp[2]] = Changed = 1; } |
gave identical results to the original. That's on a MSVC7.1 compiler.
briturner wrote: | Greetings,
I have tried a couple ideas people have posted here. This is what I saw:
1) Using a pointer to replace all G[PIdx]. This appears that it should save the evaluation of G[PIdx] in each instance. My testing showed a small speed up for (887 to 881 msec for the pearly 6000).
2) changing the MarkChanged macro. Again, the idea is for the compiler to not calculate the index. This actually caused my solver so slow considerably. I had to make each assignment a separate statement to make it work (rather than everything set to the final 1, I have 4 '= 1' statements). After that change, the timing was just slightly longer (just a fraction of a percent). Using [upc[0]]... was also in the noise.
Conclusion: These suggested changes can be helpful (10% speed up or more were seen by others), but the speedups are very dependent on the compiler and how well it optimizes. For me, they provided at best less than a 1% improvement.
One final note. On the discussion of the portability of the ++ statements in a series of assignments. For me, I found that the suggested macro did not work correctly (stats showed that is doubled much of steps, like number of naked singles). Reworking the macro to be separate assignments restored the portability.
Brit |
What compiler were you using? |
|
Back to top |
|
|
| briturner
| Joined: 14 Oct 2008 | Posts: 75 | : | | Items |
|
Posted: Fri Oct 23, 2009 11:07 pm Post subject: |
|
|
I have recently upgraded to Win 7 RC, and MS Visual Studio 2008. For releases, I use the profiled optimization for final releases, since it does make the build a couple % faster. |
|
Back to top |
|
|
| sudoking
| Joined: 20 Oct 2009 | Posts: 40 | : | | Items |
|
Posted: Wed Oct 28, 2009 4:32 pm Post subject: |
|
|
Here's an interesting development. I checked the assembler that was compiled when I used the following:
Code: | #define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x]; ChangedGroup[*(ucp++)] = ChangedGroup[*(ucp++)] = ChangedGroup[*(ucp++)] = Changed = 1; } |
On the debug compile, with no optimisation, it worked as I had expected, and was an exact replacement for the original. However, on the release compile, with optimisations on, the increments were all performed at the end of the statement, which was the equivalent of:
Code: | #define MarkChanged(x) { ChangedGroup[C2Grp[x][0]] = Changed = 1; } |
The reason I hadn't notices this before, was because this code still produced the same results. In fact, I used this line to replace the original line, tested it against 20 million puzzles, and it gave the same results for each one. briturner, is it possible that setting the remaining 2 elements of ChangedGroup (i.e. those > 8) has no effect?
Also, a note on performance increases. I'm using a relatively slow processor with a reasonably fast hard drive, so any improvements in processing will be more noticeable. This may explain why I see higher performance increases than others. It could also be because I'm using an older compiler, that doesn't produce as optimal code as that of newer compilers. |
|
Back to top |
|
|
| briturner
| Joined: 14 Oct 2008 | Posts: 75 | : | | Items |
|
Posted: Thu Oct 29, 2009 12:48 am Post subject: |
|
|
I would suspect the optimizations are not as good on the older compiler. So, the one change hit the right pattern for your compiler. I found that I could get a couple percent difference by including code that did not run with an older compiler. It was a hunt for the random unneeded change that would provide the best optimization. So, I am trying to avoid these type of changes, which try to tweak the optimizations, as compared to more legitimate changes, like changing the order of the guesses, or reducing an assignment.
When I used the modified macro, the solutions were still found, but it took maybe 40% longer. checking the stats showed that there were almost double the checks for naked and hidden singles (and thus the longer run time).
brit |
|
Back to top |
|
|
| ChPicard
| Joined: 12 Mar 2008 | Posts: 82 | : | Location: Montreal, Canada | Items |
|
Posted: Fri Oct 30, 2009 10:55 am Post subject: End the buffer in DecodePuzzleString |
|
|
Hi
Is it necessary to end the buffer each time the routine reads a new puzzle? I do it one time in main routine and it works.
Best regards
JPS |
|
Back to top |
|
|
| dmhayden
| Joined: 01 Aug 2007 | Posts: 16 | : | Location: Central New Jersey | Items |
|
Posted: Sat Oct 31, 2009 12:35 pm Post subject: |
|
|
sudoking wrote: | Here's an interesting development. I checked the assembler that was compiled when I used the following:
Code: | #define MarkChanged(x) {register unsigned char const *ucp = C2Grp[x]; ChangedGroup[*(ucp++)] = ChangedGroup[*(ucp++)] = ChangedGroup[*(ucp++)] = Changed = 1; } |
On the debug compile, with no optimisation, it worked as I had expected, and was an exact replacement for the original. However, on the release compile, with optimisations on, the increments were all performed at the end of the statement, which was the equivalent of:
Code: | #define MarkChanged(x) { ChangedGroup[C2Grp[x][0]] = Changed = 1; } |
|
You got different results because the behavior of the code is not well defined. In your example, the compiler can increment the ucp any time it wants, as long as it does so before the end of the statement, and since you're evaluating it several times, the only thing you're guaranteed is that one of the "ucp++" expressions will evaluate to the original value. You don't even know which one. |
|
Back to top |
|
|
| lkSudoku
| Joined: 16 May 2009 | Posts: 60 | : | | Items |
|
Posted: Mon Nov 16, 2009 1:20 pm Post subject: |
|
|
This code is indeed fast |
|
Back to top |
|
|
| dobrichev
| Joined: 18 Dec 2009 | Posts: 28 | : | | Items |
|
Posted: Tue Jan 05, 2010 11:48 pm Post subject: |
|
|
Good Job!
There are really interesting pieces of code.
Let suggest some improvements.
1) Data definition.
Using unsigned short instead of unsigned int for Grid, Grp, and SingleVal improves performance.
Code: |
...
unsigned short Grid[81]; // Possibilities left for each cell
unsigned short Grp[27]; // Values found in each group
...
unsigned short SingleVal[128];
|
2) Inlining.
This function definitions are working better for me along with compiling with option not automatically to inline all "suitable" functions.
Code: |
int DecodePuzzleString (int ret_puz, char* buffer);
inline void InitGrid ();
//void ProcessInitSingles (void);
void ProcessSingles (void);
inline void FindHiddenSingles (void);
void FindLockedCandidates (void);
void FindSubsets (void);
void FindFishies (void);
void DoStep (int doing_2_step, int use_methods);
inline void MakeGuess (void);
|
3) ProcessInitSingles function works slower than ProcessSingles.
Code: |
...
if (SingleCnt)
{ SCnt++;
//if (SingleCnt > 2) // If multiple singles
// ProcessInitSingles(); // process them all at once
//if (SingleCnt) // otherwise
ProcessSingles(); // process them one at a time
if (!G[PIdx].CellsLeft)
...
|
4) ChangedGroup cache slows down the process.
Code: |
//#define MarkChanged(x) { ChangedGroup[C2Grp[x][2]] = ChangedGroup[C2Grp[x][1]] = ChangedGroup[C2Grp[x][0]] = Changed = 1; }
#define MarkChanged(x) { ; }
...
void FindHiddenSingles (void)
{ unsigned short t1, t2, t3, b, t;
int i, j;
Changed = 0;
for (i = 0; i < 27; i++)
//if (ChangedGroup[i])
{ ChangedLC[i] = 1;
...
//ChangedGroup[i] = 0;
}
//Changed = 0;
}
|
5) Improvements in the most critical code - ProcessSingles.
The group cache update is unconditional, but removing the possibility may raise error and return before completion of the whole loop. Moving (and unrolling) the group cache update to the bottom benefits from eventual earlier return.
The internal loop unrolling is critical for performance. For compilers that do not support unrolling, defining the body of the loop as a macro and calling him manually 20 times works fine.
6) In the main loop most advanced algorithms are executed even if some procedure found a discrepancy at earlier stages. Checking for No_Sol before calling each method solves this.
Code: |
if (Changed){
if (!No_Sol)
{ HCnt++; FindHiddenSingles(); if (SingleCnt) continue;}
if ((!No_Sol) && use_methods & USE_LOCK_CAND)
{ LCCnt++; FindLockedCandidates(); if (Changed) continue; }
if ((!No_Sol) && use_methods & USE_SUBSETS)
{ SSCnt++; FindSubsets(); if (Changed) continue; }
if ((!No_Sol) && use_methods & USE_FISHIES)
{ FCnt++; FindFishies(); if (Changed) continue; }
if ((!No_Sol) && use_methods & USE_1_STEP)
{ OneCnt++; DoStep(0, use_methods); if (Changed) continue; }
if ((!No_Sol) && use_methods & USE_2_STEP)
{ TwoCnt++; DoStep(1, use_methods); if (Changed) continue; }
}
|
7) random.h along with its initialization are unnecessary.
8) Printing statistics for totals instead for the latest puzzle are more informative
Code: |
if (show_stats)
fprintf (outfile, "\n %9ld Singles Processed\n %9ld Hidden Singles\n %9ld Locked Candidates"
"\n %9ld SubSets Searched\n %9ld Fishies\n %9ld 1-Steps\n %9ld 2-Steps"
"\n %9ld Guesses\n %9d Max Depth\n",
//SCnt, HCnt, LCCnt, SSCnt, FCnt, OneCnt, TwoCnt, GCnt, MaxDepth);
TSCnt, THCnt, TLCCnt, TSSCnt, TFCnt, TOneCnt, TTwoCnt, TGCnt, MaxDepth);
|
9) In FindHiddenSingles I somehow decided to update the grid directly, not waiting the queue to be processed. Now I am not sure about the benefits, but probably this increases chances for earlier discrepancy finding.
Code: |
void FindHiddenSingles (void)
{ unsigned short t1, t2, t3, b, t;
...
if (!B2V[t]) { No_Sol = 1; SingleCnt = 0; Changed = 0; return; }
G[PIdx].Grid[Group[i][j]] = t;
PushSingle((char)Group[i][j], t);
...
}
|
Thank you Brian for this fast and elegant code.
I used the impressive piece of code
Code: | t1 = t2 = t3 = 0;
for (j = 0; j < 9; j++)
{ b = G[PIdx].Grid[Group[i][j]];
t2 = t2 | (t1 & b);
t1 = t1 | b;
}
...
t3 = t2 ^ t1;
| in my solver. This, along with copying the whole FindLockedCandidates improved the performance about 4 times.
Source in C++ and Windows binaries of my solver are available at (ah, these link limitations) "sites . google . com / site / dobrichev / fsss" |
|
Back to top |
|
|
| briturner
| Joined: 14 Oct 2008 | Posts: 75 | : | | Items |
|
Posted: Wed Jan 06, 2010 4:00 am Post subject: |
|
|
Greetings,
First off, congratulations, preliminary tests are showing around a 15% to 20% improvement with your program.
There is always a bit of a problem with tweaking code for efficiency, since it is very compiler and environment (os / processor / hardware) dependent. I compiled and tested the improvements you recommended (compiled directly from the file you emailed me), and for me, it actually slightly increased the amount of time needed.
On the otherhand, when I compiled your solver (fsss), it did run faster than my solver did, by some 15% to 20%. I will be looking through your code to see what some of the differences are. I do know I left some inefficiencies in my code as part of additional functionality (like handling different solving techniques and such), but that would not account for the difference.
bt |
|
Back to top |
|
|
| briturner
| Joined: 14 Oct 2008 | Posts: 75 | : | | Items |
|
Posted: Wed Jan 06, 2010 2:40 pm Post subject: |
|
|
Greetings again,
It is going to take a bit more for me to get an apples to apples timing comparison here. 2 things I noticed are:
a) you buffer the entire file IO before starting the timing, and
b) more importantly, it looks like you are attempting to run on multiple processors (I see things like "#pragma omp parallel for schedule(static, 1) reduction(+:nSolved)").
While these are legitimate speedups, it does make the timing of the actual solving algorithm more difficult.
I think I will try putting both solvers in the same timing stub, which will help give me better timings.
bt |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
Igloo Theme Version 1.0 :: Created By: Andrew Charron
|