About filled grid generators

anttiahola · Posted: Mon Sep 11, 2006 7:48 pm Post subject: About filled grid generators

While I was developing my filled grid generator I started wondering if someone has made an analysis on the generator he/she has implemented. I mean that a good generator should be able to follow two simple rules:

1. It should be able to generate all possible sudoku grids
2. If one has a perfect random number source, then each of the grids is generated with equal probability

I'm quite convinced (= I haven't proven this) that my generator is able to generate all possible grids (rule 1), but it does _not_ generate all the grids with equal probability (rule 2), which is a shame.

So, I would like to hear that does anybody think that your generator obeys both of the rules (even informal "proof" would be nice). And if so, would you like to share the algorithm?

Another question that interests me is that do you think that it is possible to create an algorithm that

- Obeys rule 1 and 2
- Does not use recursion / backtracking
- succeeds always to create a valid grid

I'm curious about this because my generator doesn't use recursion. So when it realizes that the grid that it is generating has a dead end (= It's not able to produce a decent grid), it just starts over from empty grid. Now, the gain from my algorithm is ~85%. So 15 times out of 100 it is not able to produce a valid grid. I'd like to know if someone has an algorithm that is always able to create a valid grid (without backtracking).

Thanks,
Antti

m_b_metcalf · Posted: Tue Sep 12, 2006 12:40 pm Post subject:

Well, a number of formal points:

Rule 1. We read in
http://en.wikipedia.org/wiki/Sudoku#Mathematics_of_Sudoku that there are 6,670,903,752,021,072,936,960 possible solution grids, so if you can generate one per second, it will take you 200 trillion years to find each one once.

Rule 2. Do you really have access to a perfect random number generator? All the ones I know of are
pseudo-random generators, i.e. are deterministic and cycle. A true random number generator, such as that used in the UK to select Premium Bond winners, is based on a physical procedure, such as electical sparking, that is non-deterministic.

That said, I have a number of sudoku generators, one of which is based on a pseudo-random number generator that builds the grid a row at a time (see http://delivery.acm.org/10.1145/1130000/1124709/p4-metcalf.pdf?key1=1124709&key2=2504608511&coll=&dl=ACM&CFID=15151515&CFTOKEN=6184618 for details.) This, like yours, has to restart, at least partially, in about 10% of cases (I haven't measured that exactly, just an impression).

I'm curious to know how you can tell that your generator does not generate all the grids with equal probability.

Regards,

Mike Metcalf

anttiahola · Posted: Tue Sep 12, 2006 2:50 pm Post subject:

Well, the rules 1 & 2 are a bit academic. It's the same thing as when writing a poker game. Even if I'm using a pseudorandom number generator, I still prefer that the algorithm is theoretically able to obey the rules 1 & 2.

The proof that my algorithm does not produce all the grids with equal probability is accomplished simply by finding a pair of example grids that are not equally likely. This can easily be done programmatically.

Example:

When my algorithm is in a middle of generating a grid, it comes to a point when it needs to use random number generator to decide what number to put in a certain position. Lets assume that there are two alternatives, namely A and B. Now both of these alternatives has a 50% chance to be chosen. Now if the algorithm chooses path A, it finds that it is able to generate a maximum of 6 different boards (equally probable). If it chooses B, it can generate only 4 different boards (again equally probable). Now it is obvious that the probability to create any of the boards after choosing B is bigger than creating any of the boards after choosing A.

Proving that your algorithm does obey the rule 2 is usually much harder.

Ruud · Site Admin

The bias is only present at the moment you have to choose between 2 candidates within a single constraint. It has no effect on the chance for any of the grids to be generated, because a different order may give another result. The order itself being random eliminates the bias.
Only when you always start with the same cell, you will create a bias.

You can verify this by using the complete candidate space.

Keep a collection of ALL 729 candidates, and give each of them a random number. Sort the list by this random number.

Take the first candidate in the collection and place it in the cell. Check for immediate eliminations and remove the eliminated candidates from the collection. Check for singles and place them, clearing further candidates.

Repeat this until you get stuck - start over - or reach a solution.

Ruud
_________________
Meet me at sudocue.net

anttiahola · Posted: Tue Sep 12, 2006 5:36 pm Post subject:

Mauricio · Posted: Tue Sep 12, 2006 7:22 pm Post subject:

anttiahola · Posted: Wed Sep 13, 2006 5:38 am Post subject:

Mauricio · Posted: Wed Sep 13, 2006 5:33 pm Post subject:

Well, after thinking about your questions, I think that it is very difficult to do a efficient generator that obeys your rule 2; though I found a generator that obeys your rule 2:
1. Fill ALL the cells of the sudoku randomly (filling each cell with a number between 1 and 9 with the same probability).
2. Check if the sudoku is valid.

Obviously that algorithm generates all possible sudokus and does it with the same probability each, and obviously it is VERY inefficient.

After thinking a little, I am convinced that my generator does not obey your rule 2.

I generate a filled sudoku grid using DLX, with a little modification so that generates a random grid, and I succes to generate a random grid for each one of the 324!x(9!)^324 possible inputs (I permute the order that I link the 324 columns randomly and then I permute the order that I link the columns, 324 times (once for each column)). But we must note that the number of total sudokus (6,670,903,752,021,072,936,960) does not divide 324!x(9!)^324, so my algoritm can not be unbiased.

Regards.
Mauricio.

The Ostrich

Hi,

This sort of problem is very common in tree and graph algorithms. The problem in question is similar to inverse julia set rendering (if you're into fractals at all). Basically, think of all possible partial and completed sudoku grids as a tree, with solutions at the leaves. If some parts of the tree are denser than others (which they will inevitably be), then choosing randomly between candidates results in a non-random distribution of leaves.

My feeling is that this can be fixed iteratively by an algorithm which keeps track of which parts of the tree have been visited in all previous traversals. Suppose we flag nodes with estimates of the number of solutions above each.

If we find a node where one or more branches have never been visited, we choose randomly among the unvisited branches.

If we find a node where all branches have been visited, we choose randomly, weighted by the estimate of the number of solutions along each branch (ie the stored estimate of the number of solutions in the next node in that direction).

So how do we generate the estimates? Here is a stable way, which converges on the exact number as the search progresses:

1. When we find a solution, we mark it as a leaf, and set #solutions = 1.
2. We then backtrack through the nodes that lead to that leaf. For each node, we set #solutions = A + B

where:

A = (sum of #solutions of all visited nodes above the current node)
B = (number of unvisited nodes above this node) * A / (number of visited nodes above this node)

The information about nodes in the tree from all previous searches can be stored in a hash table, where nodes are uniquely identified by nonary routing codes.

We need one addition to eliminate bias. Each node needs to store a second number, which is the number of solutions above that node already visited. Instead of using the pure #solutions in making our routing decisions, we instead need to use (#solutions estimated - #solutions found). This has the effect of modifying the algorithm so that it can't "see" any part of the tree which has previously been explored by the algorithm - i.e. changes the tree we're traversing to the tree of sudoku grids yet to be traversed.

The algorithm has only one non-random bias I can see, which is that it will not ever produce the same solution grid twice until all solution grids have been produced. This can be fixed (because we know the exact number of solution grids) by storing all previously found solution grids, and randomly choosing at the beginning whether to return an already found grid, or run the algorithm.

Can anyone see why this wouldn't work, or any biases that I haven't spotted?
_________________
Download Sudoku and Killer Sudoku!

Sudoku Tiger, the revolutionary new sudoku and killer sudoku program is now available. Download it at www.sudokutiger.com