Setting Sudoku for Human Solvers (Sudokorists)

Agent Allen

I'm working on a setting program the creates puzzles for humans to solve.
So I'm interested in 'hardness' measured in human terms.
A feature-list goes somewhere for novices and intermediates - I don't want to present puzzles that require 'advanced' techniques to non-experts.
However there is to my mind a nearly well defined measure of AESM rating:
Definition
Average Expert Sudokorist Minute Rating (AESM rating) is a human oriented measure of hardness for sudoku puzzles.
Given a large group of expert sudokorists, if their average time to solve a given puzzle is N minutes, then that puzzle has an AESM rating of N. It can also be described as an N AESM puzzle.
The experts should work under ‘standard examination conditions’ presented with one standard-sized printed copy of the puzzle which they are to ’solve in’ and an unlimited supply of pencils, erasers and paper.
If an individual should make a mistake and present an invalid solution as complete, they would return to working without further clue (except that they haven’t finished) and complete the puzzle without penalty. Their solution time would be the total time they take until presenting the valid solution.
The experts will know that a solution will exist and is unique but they don’t know who set it or what techniques were used in devising it.
Commentary
I have avoided a definition of “expert”.
We could use a relative ranking (e.g. experts are the top 20% of sudokorists) or a more objective definition (an expert is someone who knows all the good techniques and when to apply them and has a wealth of experience applying them effectively). In principle we should find a reasonable overlap between these definitions.
Notice that some effective techniques call for some form of SIAS (suck it and see) guesswork.
So variances in time between sudokorists will reflect both objective ability and subjective “luck on the day“. That is, notice that the definition is averaged across both ability and “luck on the day” within the (strictly notional) community of expert sudokorists.
So what are the best computer heuristics for generating puzzles within a pre-set range of AESM ratings?
I believe that we should find that AESM rating is somewhat invariant across the standard universal transformations for puzzles. So permuting the numbers 1 through 9 or the columns in the houses (AKA: blocks) or the rows in the houses or the actual columns of houses or the actual rows of houses or rotation of the puzzle should be given about the same AESM rating. Such changes shouldn’t really confound experts. That said, if we take the definition at complete face value we would expect to find experts using ‘is it 1, is it 2,…’ techniques and be able to transform puzzles to make them take the long route round (I.e. contrive to ‘make it 9‘). I’d also expect experts to show further structural (including cultural) bias between logically equivalent puzzles. In particular we‘d expect westerners to start scans at the top left and proceed right and down.
The method I’m working on generates a random solved puzzle then depletes cells rating the puzzle at each step until it finds a puzzle with only one solution and the required difficulty.
One obvious and well-known fact is that you can’t deplete back to uniqueness.
It would also appear you can’t deplete to an easier (lower AESM rating) puzzle. Its hard to believe experts can be confused by more givens.
I also believe that by this measure the easiest unsolved puzzles are all those puzzles with only one cell empty in the top left hand corner with a solution that is 1 thru 9 on the top row and 1 - 9 reading left to right and down in the top-left house.
I think this puzzle has an AESM rating of 0.05 (3 seconds) and that anything under an AESM rating of 2 (i.e. 2 minutes) should be considered trivial.
Obviously you can come up with AESM ranges for difficulty grades from "Trivial" to "Super Hard" or whatever words you like.
I’d love hear from anyone who has thoughts on:
1. Good ways of determining AESM but computation.
2. What is the highest AESM rating, which puzzles have it and why?
3. What are the most effective expert human techniques and how do experts select them?
4. Where can I obtain a baseline of puzzles and completion times to calibrate against?
5. How effective in evaluating AESM rating is weighting a feature list and which features to include?
6. What ‘feature counter’ algorithms have an appropriate level of invariance under transformation?
7. Do the universal transforms (above) actually have little affect of AESM rating? Do any others?
8. The best ways of obtaining a random solution.
The technique I am using is a form of “Ariadne’s drunken walk“. That is: fill blank cells up with 1-9 “candidates“. Pick a cell with the minimum population at random and try a candidate in it at random. Proceed with picking and trying at random recursively coming back and trying the remaining candidates if we get to an insoluable point.
I’m pretty sure that doesn’t produce an even distribution on the set of all completed sudokus.
I tried the form where the cell to attack is chosen at random, but found that the routine was very erratic in execution time.
Ideally I’d have an algorithm that returned a uniform distribution on all solved puzzles.
That is possible given that the space has been counted in a constructive brute force algorithm (http://www.shef.ac.uk/~pm1afj/sudoku/sudoku.pdf) we could in principle pick an element at random.
9. The best way of depleting to a puzzle.
My current approach is to create a random route across the puzzle and then try every cell blank and "given" in that route recursively until a puzzle in the desired difficulty range is found.
If I get to puzzles that are too hard or have more than one solution, I retreat. In principle I'm going to try all 2^m*m (where m is puzzle side) 'masks' to see if
Again, I’m sure this has weaknesses in terms of producing a uniformly random puzzle in the desired range of difficulty.
10. How is difficulty split between puzzle and solution?
Do all solutions have difficult puzzles or is it only some? If so, which and why?
_________________
Agent A

Lummox JR · Posted: Sun Oct 02, 2005 1:31 am Post subject:

dukuso · Posted: Sun Oct 02, 2005 5:46 am Post subject:

I assume that human experts would use almost the same method
as the fastest computer programs.
And that will probably include not much more than looking
for naked/hidden singles, and single unit candidate.
And T&E or systematic search with backtracking if the
above don't suffice.
And since asking many humans for their solution time
is infeasable, I rate them by 100 randomized
computer runs. (only naked/hidden singles).
Rating by the hardest technics needed has the disadvantage
that the easier technics could be hard to find.
E.g. if there are 10 turbofishs and any of these solves the puzzles
that's much easier than if you have to find all of them.

-Guenter.

Agent Allen · Posted: Wed Oct 05, 2005 9:50 pm Post subject:

Lummox Jr, thanks for the feedback.
you said:

dukuso · Posted: Thu Oct 06, 2005 5:21 am Post subject:

ï»¿Lummox Jr, thanks for the feedback.
>you said:
>
>Quote:
>
>I don't think this really makes a difference in sudoku...
>
>
>I agree. It was worth pointing that people probably are bias.
>I don't think a computational definition of 'complexity' or
> 'difficulty' should show bias. So Iâ€™m intending to be slightly
> easy going about bias in what I'm doing.
>
>Surely any useful computational definition should be actually
>invariant - not nearly invariant over those transforms.
>That was my little point.
>
>You said:
>
>Quote:
>
>In practice this is impossible. It's quite possible, in fact
>probable, to have a minimal puzzle â€¦
>which is startlingly easy, requiring only singles to complete.
>
>
>I agree. It appears lots of minimal puzzles are easy.
>The routine I coded 'back-tracks' and tries another pattern
>when it 'falls off' an easy path. It also back-tracks if it
>gets too tough.
>That method appears to usually come back with a puzzle of the
>required difficulty when the required difficulty isn't ridiculously hard.
>
>That's why I asked questions about whether a given solution has
>easy or hard puzzles or whether it was in the mask and why
>I pointed out the 2 obvious lemmas:
>It was worth pointing out that if you find duplicates your
>stuffed and if it gets too hard it ainâ€™t getting easier.
>I also wanted to point out that we would mostly agree that
>even for humans you can't make it easier by making it harder!
>Remember people are funny and tricks like "antidisestablishmentarianism
>is a long word. Spell "it" " work! So you can make things
>easier by providing less information sometimes! Humans have
>an innate dumbness and staggering ability to not behave as
>people expected!
>Again I'm trying to understand the human form of solving so
>its worth keeping an eye on the funniness of human behaviour
>rather than thinking of them as merely slow computers.

which I think they are.

>Which brings me on to dukuso's mail - thanks for this too.
>Dukuso said,
>Quote:
>
>>I assume that human experts would use almost the same method
>>as the fastest computer programs.
>
>
>We know that in general people and computers don't solve problems
>the same way.

that's only because programmers are lazy. They usually choose the easiest
way to implement something, not the fastest or most efficient.
Often there are some tricks which are easy to see for humans
but hard to implement.

>So it probably isn't effective to just multiply up the best
>computer solvers time to obtain human times. But if it was,
>Iâ€™d still like to calibrate the definition for humans.
>I want to make sure Iâ€™m using a model of the techniques
>people can actually do in their heads. Humans are notorious
>for having little 'working memory' and struggle to complete

they have an unlimited source of paper for memory.
I never heard about lack of enough paper as a problem in
solving sudokus.

>at all problems computers have in micro-seconds.
>You don't build linked lists in your head and people are very
>poor at performing Ariadneâ€™s thread backtracking without
>missing great big bits of the tree.

they could learn it. I have been told that the best human
sudoku-solvers do use backtracking. Even if you miss some
branches occasionally it could still be a good heuristics.

>Computers on the other hand have more working memory than
>you can shake a big stick at.

..which isn't being used for sudoku-programs. I think that the
best sudoku programs don't use much more memory than what
humans to with a list of candidates.

>Further, you need to take care that to grade a problem
>you might have to run it through several solvers or several
>â€œsolvingsâ€œ.

yes. Or through several humans, which is harder in practice.

>You see I think that if program X 'works from top left' and
>program Y 'works from the bottom right' on particular problems
>they might get very different timings.

yes. That's why I use 100 randomized runs for my ratings and
then take the average number of nodes needed to solve it.

>Over the space of all problems they might be even.
>However individual problem A might be easier from top left
>and harder from the bottom right - depending on your method -
>and at least show different timings.

same problem for humans.

>There's an item in another listing on this forum where someone
>posted a tough puzzle, rotated it and found it easy. So which
>is it hard or easy?
>The answer depends on the definitions of easy and hard in use.
>For the definition above I believe the answer may be some form
>of â€œmedium" because as Lummox Jr points out - we suspect people
>'wander around' the problem and some of them will find the easy way
>in and others take a longer route.
>I'm suggesting that any 'useful' objective definition of difficulty
>of puzzles as opposed to efficiency of programs will have the
>property that all the standard transforms of a puzzle with have
>the same difficulty.

agreed.

>That isn't a contradiction to the possibility that the fastest
>programs might not take the same time on all elements of all
>the equivalence classes.
>
>You point out that the best methods known resort to trial and error
>in the end, and those methods in particular will have this problem
>that X is a trivial transform of Y, but X takes less effort than Y
>for a given program.
>That's because good trial and error plays the averages over problems.
>I'm looking at an average over methods!
>
>You said,
>Quote:
>
>asking many humans for their solution time is infeasable
>
>
>I don't see why so much. There's plans for a national championship -
>http://www.timesonline.co.uk/article/0,,2-1611106,00.html.
>Surely we'll get some published data out of that to fit the bill.

you won't have humans competing with computers on issues like that !
It's still useful to compare human ratings with computer ratings -
but only for calibrating the computer programs.

>Anyway http://www.sudokufun.com/ publishes some statistics.
>I haven't directly accepted those in the definition, because
>the conditions are probably varied and unclear.
>But in the absence of great data, that data might give some help.
>
>If the competition rules of the championship don't match the
>environment set out definition, I'll change it.
>I don't care much about those parts of the definition, but want
>the conditions to be well-defined, reproducible and conducive to puzzling.
>
>Remember a definition can be useful without us having the means
>to directly observe it.
>Avagadro's number and light-year spring readily to mind as
>definitions we find useful but can't at this time directly measure.
>
>The environment thing is a bit like STP - easier to achieve than
>physically significant.
>_________________
>Agent A

we will have different conflicting measures for some time depending
on the technics involved in the program.
Mine only looks for naked single/hidden single which are the basic
constraints and thus of some theoretical interest.
I don't expect that this is the fastest method, though.
Take the fastest computer solver around, let it solve some thousand
equivalent transformations of a sudoku and take the total time.
That would be my favourite measure of sudoku-hardness.

I expect there will be good correlation with human defined measures.
Probably even better than correlation between different
schools of human solvers.

-Guenter

scourtaud · Posted: Wed Dec 14, 2005 12:21 pm Post subject:

Hi,

Today being the day I attempt do generate sudoku with PHP, I'm quite interested in the idea of a global Difficulty rating.

I can still see a problem with your approach :
A such global rating would be impossible to implement widely since you can't imagine having a panel of sudokurist solve all possible grids. This means that some grids will stay unrated (most of them I guess). The AESM would then be statisticaly inexistent regarding the number of rated grids against the number of possible grids.

Example :
I've got a book of 150 easy sudoku grids. Considering those grids are easy they might have an average rating of 2. this means that to obtain a valid rating on all of those, you need to have something like 100 experts with 10 Free hours (2 minute break after each grid for verification/coffee/smoke/phone/..) locked in a room.
This seems very unlikely to me and very expensive for 150 easy grids.

Maybe a website could do the trick but you then have no control on the result. Have they used a solver, if each grid isn't given at the same time to each expert, how can you be sure they don't pass along results?

I think it would be much more efficient and cost effective to use a pannel of solvers to see how long it takes for them to solve, what techniques to use and find a relation with what a human brain could experience. This may not be as accurate as AESM but it definitly is more feasable.

On top of that, the experts pannel may get you false results. They are experts. they could easily judge the quality of a grid but there is no way the time it would take for them to solve a given grid would give a rating that would correspond to the majority of sudoku players. If I see this has got the lowest rating and I find it difficult, I may not enjoy playing anymore and sudoku dies of making me feel stupid.

Take an example, With the follwing ratings :
Naken / Hidden singles : 1
Block/ block and block / col & row interactions : 2
Naked / hidden subsets : 3.

I can rely on this rating to give me a grid at my level. and even if the subsets are harder than normal, I understand the technique and I know it at my reach, I just need to work harder.

And if I want to progress on my sudoku solving techniques, I just have to learn new techniques, all I have to do is read about the next technique and try a puzzle of the next rating. Solve it and go on...

Hope you find this usefull...

Sebastien

donaldf · Posted: Mon Feb 06, 2006 10:26 pm Post subject:

Just a note from my perspective:

A puzzle rated by Guenthers algorithm is quite stringent and easily solved using basic techniques up to a rating of about 6000. This is so because a 6000 rated puzzle most often has 21 clues set, and one node is needed for each of the 60 digits that has to be entered. This also means that these puzzles do have naked or hidden singles for all configurations from start to end of the solving procedure. These puzzles are often percieved as solvable albeit not always easy to humans. (I have 15 employees that I terrorize every coffee break with new sudokus Smile

Puzzles with a rating over 6000 have less than 20 clues at times, and a puzzle with a rating 7000 for sure needs guesses (or candidate list management or backtracking methods) to be solved. Commonly, these puzzles are precieved as very hard to humans.

Another commonality is that the rating of Guenthers rater are convergent for puzzles with a rating of less than 6000, while puzzles with a rating of 7000 or more often diverges somewhat.

My point is that if we are going to generate Sudokus for ordinary sunday sudoku solvers, we better keep the rate below 7000. If we want to create nightmare Sudokus for the expertise, we never need to go over 10 000.
_________________
It's better to seek forgiveness than to ask permission

Agent Allen

The point I'm trying to make is that a rating would be more useful if it could be identified as an estimater of some useful value.
That value may be hypothetical and indeed a statistic in itself.
Counting features is a very useful thing to do and if your setting for limited ability necessary.
However placing arbitrary numbers next to those features and toting them up
isn't likely to be leading you to a useful objective statistic of the puzzle.

The panel is a thought experiment, it may never exist, it doesn't have to for me to write a program that I believe is estimating how long that panel would on average take.

My problem with the arbitrary scales is a bit like pawn=1,bishop=3,queen=10, rook=5, etc. for chess.
It gets novices in to the game, but isn't a professional tool because it is appallingly crude.
You're certainly an absolute novice if you don't use position AT ALL!

So I was driving for ideas on how humans might actually solve puzzles and how we might estimate the time it might take them.

The method I'm been developing includes an 'cost' relating to the number of 'can't fill' spaces at each stage.
That's intended to account for 'hunt' costs which for the simpler end of the market have a great influence on difficulty.
If it is all hidden/naked singles which is what pro setters appear to give the newbies, difficulty greatly depends on how many opening moves there are.
It seems for toughies the same thing applies to an extent.

I've seen suggestions around the we should incorporate that, but haven't seen any clear designs.
My experimental model is:

1. Find the set of all the Hidden Singles (H in size) in a puzzle of size L with K known.
2. Introduce a factor for L- H -K. to represent the effort spent on fruitless inspections.
NB: L is the total number of cells. 81 for yer standard grid.
3. Fill in ALL the hidden singles just found.
4. Go again. If hidden singles runs out, try something stronger.
5. If you try a method and get nothing, pay a penalty.
6. After each 'stronger' method slink straight back to hidden singles because they're cheap.
7. If you come to the end either declare it 'too hard' or use trial and error.
If you use trial and error, use a minimally pivot and evaluate all cases - not just until you find the answer.

In an ideal world it might try each hidden single in turn to see what it unlocks and so on, because your chances of the next one depends in part on how many your last move unlocked.
Unfortunately the number of branches is beyond our science - 10^100 kind of numbers and there don't appear to be 'flattening' transformations.

The challenge is to then estimate a 'column' of methods with costs in seconds that reflect your audience.
_________________
Agent A