Sudoku Programmers Forum Index

 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister   ProfileProfile   Log inLog in          Games  Calendar

Log in to check your private messagesLog in to check your private messages   

recognizing text grids
Goto page 1, 2  Next
 
Post new topic   Reply to topic    Sudoku Programmers Forum Index -> Programming sudoku
View previous topic :: View next topic  
Author Message
scrose

Joined: 01 Jun 2005
Posts: 23
:

Items
PostPosted: Fri Jun 17, 2005 3:19 pm    Post subject: recognizing text grids Reply with quote

This is a thread to discuss techniques on how to recognize text-based sudoku grids. The topic was raised in the Pappocom forum. The attached post got the ball rolling. I'll have my own two-cents to add in a day or two.

Animator wrote:
scrose wrote:
Thank you, even pseduo-code would be greatly appreciated. The day before I discovered your program, I was contemplating how to write code to recognize a variety of text-based sudokus. And as with most ideas, I discovered someone had already done it!

A bit an off-topic reply, but here is it anyway: regular expressions are definitely an option... When a grid is pasted on this forum I usually use 2, 3 or 4 regexes to convert them to the format I'm used to...

The only bad thing about it at the moment is that I rewrite the regexes for each puzzle... But that's mainly due to the fact that they are easy and that the language I use allows them to be used really easily...

A small example of the command I run (assuming input: 123 456 ..., and output: 1 2 3 | 4 5 6 | * * * ):

Code:
perl -pi -e 's/ /|/g;s/\./*/g;s/./$& /g;' filename
Back to top
View user's profile Send private message
scrose

Joined: 01 Jun 2005
Posts: 23
:

Items
PostPosted: Fri Jun 17, 2005 5:06 pm    Post subject: Reply with quote

Here is a list of programs (that I have encountered and used) that perform input recognition of text sudoku puzzles.

simes
rubylips - allows one format only
angusj

Please mention any other programs that also perform input recognition and I will update this list.
Back to top
View user's profile Send private message
Animator

Joined: 26 Apr 2005
Posts: 18
:

Items
PostPosted: Fri Jun 17, 2005 6:04 pm    Post subject: Reply with quote

I think it would be more intresting to see a list of different 'styles'/grids that are used... (it would allow a solution to be tested)

Making something that will work all the time sure is a complex thing... the main problem is how do you identify the charachter that is used to indicate an empty cell?

Making something that works 90% of the time should be slightly easier I guess...
Back to top
View user's profile Send private message
scrose

Joined: 01 Jun 2005
Posts: 23
:

Items
PostPosted: Fri Jun 17, 2005 6:25 pm    Post subject: Reply with quote

The format native to rubylips' java app.

Code:
 . 2 1 | 9 3 . | . . .
 . . 5 | . . 7 | . . 2
 . . 7 | . 6 . | 8 4 1
-------+-------+------
 . 4 . | 5 . 6 | . . 8
 7 . 6 | . 1 . | 2 . 3
 5 . . | 3 . 2 | . 9 .
-------+-------+------
 3 8 9 | . 2 . | 4 . .
 6 . . | 7 . . | 1 . .
 . . . | . 5 9 | 3 2 .

The format native to angusj's app.

Code:
4XXX91XX8
XXX2X8XXX
X1XX53X6X
XX31X79XX
XX1XX97XX
XX75X23XX
X64X1582X
1XX9X46XX
5XXX261X4

Here are some examples I have pulled from the players' forum.

Asterisks seem to be a popular empty-cell placeholder. Note the different methods for differentiating blocks.

Code:
* 3 4 * * * * 9 *
* 1 * 9 * 7 3 * *
9 * * * 3 * * * 1

3 8 7 4 1 6 9 2 5
* 2 * 8 7 9 6 4 3
6 4 9 3 5 2 1 7 8

4 * * * * * * 3 *
* * 8 7 * 3 * 1 *
* 9 3 * * * 2 8 *


4** *91 **8
*** 2*8 ***
*1* *53 *6*

**3 1*7 9**
**1 **9 7**
**7 5*2 3**

*64 *15 82*
1** 9*4 6**
5** *26 1*4


8**|**1|*3*
**1|8**|**4
***|9**|*7*
-----------
*5*|**9|6**
*4*|***|*5*
**9|7**|*4*
-----------
*8*|**2|***
3**|**5|9**
*2*|1**|**5

The following example was not posted in a fixed-width font, which is why the horizontal lines are wider than the puzzle.

Code:
----------------
|x42|63x|xx5|
|xx5|27x|xxx|
|xxx|954|x28|
----------------
|19x|862|x5x|
|xx3|145|2x9|
|52x|397|xx6|
----------------
|759|416|xx2|
|231|589|467|
|4xx|723|591|
----------------

And I came across this monster that uses spaces as empty-cell placeholders. I suspect something like this might be a challenge to accomodate.

Code:
#####################################
# 2 | 9 | 5 # 7 |   |   # 8 | 6 |   #
#---+---+---#---+---+---#---+---+---#
#   | 3 | 1 # 8 | 6 | 5 # 9 | 2 |   #
#---+---+---#---+---+---#---+---+---#
# 8 |   | 6 #   |   |   #   |   |   #
#####################################
#   |   | 7 #   | 5 |   #   |   | 6 #
#---+---+---#---+---+---#---+---+---#
#   |   |   # 3 | 8 | 7 #   |   |   #
#---+---+---#---+---+---#---+---+---#
# 5 |   |   #   | 1 | 6 # 7 |   |   #
#####################################
#   |   |   # 5 |   |   # 1 |   | 9 #
#---+---+---#---+---+---#---+---+---#
#   | 2 |   # 6 |   |   # 3 | 5 |   #
#---+---+---#---+---+---#---+---+---#
#   | 5 | 4 #   |   | 8 # 6 | 7 | 2 #
#####################################

Here is an example copy-and-pasted from Excel. I have used \t to indicate tabs.

Code:
\t\t4\t8\t7\t5\t\t\t
5\t3\t1\t\t\t\t8\t\t
\t7\t\t\t\t\t2\t4\t
7\t\t\t5\t\t1\t\t\t2
8\t\t\t\t\t\t\t\t3
4\t\t\t6\t\t2\t\t\t9
\t2\t6\t\t\t\t\t7\t
\t\t7\t\t\t\t5\t1\t8
\t\t\t7\t9\t4\t3\t\t
Back to top
View user's profile Send private message
scrose

Joined: 01 Jun 2005
Posts: 23
:

Items
PostPosted: Fri Jun 17, 2005 8:21 pm    Post subject: Reply with quote

Here is a really simple list of tasks that a recognition script needs to accomplish.
  1. Get your input and validate it. Less than nine lines or input without any digits are pretty good indications that the input is faulty.
  2. Eliminate the lines which do not contain any digits. These lines are likely spacers between rows and/or blocks. However, be careful not to eliminate rows of the puzzle that naturally do not contain any digits.
  3. Focusing on the lines that contain digits, try and figure out what character is being used as a placeholder for an empty cell. (As Animator pointed out, this is the key step.) Once this character is determined, all other non-digit characters (plus the digit zero) can be removed.
  4. Take what you have left and plop it into some sort of data structure.
Here is a quick stab at some Perl code that incorporates steps 1 and 2. The data is read into @rawinput which is then quickly checked for at least nine lines and at least one digit. @rawinput is then examined for lines that contain no digits. These lines (and their location in the puzzle) are copied into %nodigits. Finally, a quick calulation is done to determine how many lines are acting as spacers and how many lines are rows of the puzzle that happen to have no digits.

Edit: Removed my code since it is more-or-less duplicated below.


Last edited by scrose on Tue Jun 21, 2005 2:49 pm; edited 1 time in total
Back to top
View user's profile Send private message
angusj
Site Admin
Joined: 18 Jun 2005
Posts: 406
:

Items
PostPosted: Sat Jun 18, 2005 4:19 am    Post subject: Reply with quote

scrose wrote:
The format native to angusj's app.

Although my Simple Sudoku stores puzzles in the format you showed (together with other info detailing all solving steps taken up to the point of saving), it can still parse just about all the usual formats used in the Sudoku.com forums. However, formats which use spaces for unsolved cells do cause problems, but fortunately most posters avoid that format style. The tabbed output style you've shown I haven't seen before (and I doubt it'll catch on as it's very hard to visualize) and I'm unlikely to program Simple Sudoku to recognise that format.

scrose wrote:
2. Eliminate the lines which do not contain any digits. These lines are likely spacers between rows and/or blocks. However, be careful not to eliminate rows of the puzzle that naturally do not contain any digits.

This seems like a contradiction to me Smile.

Here's some pseudocode to show how Simple Sudoku recognises different formats:
Code:

1. Copy plain text from clipboard and store it in an array of lines
2. Trim blank lines ... then if (lineCount < 9) quit.
3. If (> 11 lines AND top line contains no digits) {delete top line}
4. If (> 10 lines) {
      if (the middle lines 4 & 8 contain no digits) {delete middle lines}
      else {quit}
    }
5. for each line ... If (linelength = 13) {delete first and last char}

6. Strip spaces from lines. I thought long and hard about this step but finally decided it was worth avoiding the complexity of the alternatives.
7. If (lineLength == 11) {
      if (the middle chars 4 & 8 are not digits) {delete middle chars}
      else {quit}
    }
8. If (lineLength != 11) {quit}

Finally, the top 9 lines will have 9 chars.
Treat any non-digit chars in these lines as blanks.
Back to top
View user's profile Send private message Visit poster's website
Simes

Joined: 08 Apr 2005
Posts: 71
:
Location: North Yorkshire, UK

Items
PostPosted: Sat Jun 18, 2005 9:05 am    Post subject: Reply with quote

require a minimum of 9 lines? How about requiring a minimum of 81 characters?

I've seen a solver somewhere that used a single 81 digit line.
_________________
Simes
www.sadmansoftware.com/sudoku
Back to top
View user's profile Send private message Visit poster's website
scrose

Joined: 01 Jun 2005
Posts: 23
:

Items
PostPosted: Sat Jun 18, 2005 11:36 am    Post subject: Reply with quote

Another example.

Code:
030200781
008000304
074030960
000180607
017609038
800375149
041060890
700000410
980413070
Back to top
View user's profile Send private message
scrose

Joined: 01 Jun 2005
Posts: 23
:

Items
PostPosted: Sat Jun 18, 2005 11:40 am    Post subject: Reply with quote

Simes wrote:
require a minimum of 9 lines? How about requiring a minimum of 81 characters?

Excellent requirement. How I missed that...

Simes wrote:
I've seen a solver somewhere that used a single 81 digit line.

I think it might be Pete Wake's solver that you had seen. Here is an example of the format he uses.

Code:
_2_______+___6____3+_74_8____+_____3__2+_8__4__1_+6__5_____+____1_78_+5____9___+_______4_
Back to top
View user's profile Send private message
scrose

Joined: 01 Jun 2005
Posts: 23
:

Items
PostPosted: Sat Jun 18, 2005 12:01 pm    Post subject: Reply with quote

angusj wrote:
The tabbed output style you've shown I haven't seen before (and I doubt it'll catch on as it's very hard to visualize) and I'm unlikely to program Simple Sudoku to recognise that format.

I agree, it's unlikely to see that tabbed format displayed on a forum, just as I have only occasionally seen Pete Wake's format (above) posted in messages. It all depends on how robust one wants their recognition code to be. Whitespace, of any sort, can be tricky to deal with.

angusj wrote:
scrose wrote:
2. Eliminate the lines which do not contain any digits. These lines are likely spacers between rows and/or blocks. However, be careful not to eliminate rows of the puzzle that naturally do not contain any digits.

This seems like a contradiction to me.

Sorry, I wasn't clear with what I was trying to state. Consider the following two examples.

Code:
 7 . . | . . 4 | 1 . 2
 . 4 3 | . . 7 | . 9 .
 . . 5 | . 3 . | . . 6
-------+-------+-------
 . . . | . . 6 | . 5 4
 . . . | . . . | . . .
 3 5 . | 1 . . | . . .
-------+-------+-------
 5 . . | . 4 . | 2 . .
 . 8 . | 9 . . | 5 7 .
 1 . 6 | 3 . . | . . 9


 1 . . | . 3 7 | 5 9 .
 2 . 8 | 5 . . | 1 . .
 . . . | . . . | . . .
-------+-------+-------
 . . 4 | . 7 . | . 6 .
 . . 5 | 3 6 1 | 2 . .
 . 1 . | . 2 . | 3 . .
-------+-------+-------
 . . . | . . . | . . .
 . . 9 | . . 3 | 8 . 2
 . 7 2 | 6 1 . | . . 9

The top puzzle contains three lines that contain no digits, however one of those lines is row 4 of the puzzle. The bottom puzzle contains four lines that contain no digits, however two of those lines are rows 3 and 6 of the puzzle. I was trying to suggest to watch for these special cases when eliminating lines that do not contain digits.

Also, consider a puzzle where rows 1 and 9 contain no digits. Recognition code would need to be wary of considering these lines as extra padding. This example is from the Guardian on June 10.

Code:
 . . . | . . . | . . .
 . 6 8 | 4 . 1 | . 7 .
 . . . | . 8 5 | . 3 .
-------+-------+-------
 . 2 6 | 8 . 9 | . 4 .
 . . 7 | . . . | 9 . .
 . 5 . | 1 . 6 | 3 2 .
-------+-------+-------
 . 4 . | 6 1 . | . . .
 . 3 . | 2 . 7 | 6 9 .
 . . . | . . . | . . .

Thank you very much for posting your pseudo-code! I'll have some time to digest it further later today.
Back to top
View user's profile Send private message
Animator

Joined: 26 Apr 2005
Posts: 18
:

Items
PostPosted: Sat Jun 18, 2005 5:01 pm    Post subject: Reply with quote

Here are the steps that I would use:

1) reconisze the symbol used for empty cells.
2) remove all symbols except the digits and the symbol ere is a really simple list of tasks that a recognition script needs to accomplish (and perhaps newlines/line-endings too).
3) verify the symbol. The count of the numbers + the count of the symbol must equal 9 (for each row, column, box) (or whatever the size of the grid is)
4) Start parsing the grid. (Input now is something like: 1*34*678*\n*56789**3\n*****7***\n123456789...)

The main problem ofcourse is point 1.
point 2 is a problem when the symbol used for empty cells is the same as a seperator in the grid. (space or tabs for example)
Back to top
View user's profile Send private message
Simes

Joined: 08 Apr 2005
Posts: 71
:
Location: North Yorkshire, UK

Items
PostPosted: Sun Jun 19, 2005 9:04 am    Post subject: Reply with quote

Here's my code.

It takes a string, removes line breaks and any other characters I've seen used only for layout, replaces characters usually used for empty cells with zero. It then decides whether spaces and asterisks should be zero or layout depending on whether it's already got any.

Seems to work!

Code:
function TSudokuBoard.TidyText(const value: string): string;
begin
  result := value;
  result := StringReplace(result, #13, '', [rfReplaceAll]);
  result := StringReplace(result, #10, '', [rfReplaceAll]);
  result := StringReplace(result, '|', '', [rfReplaceAll]);
  result := StringReplace(result, '!', '', [rfReplaceAll]);
  result := StringReplace(result, '+', '', [rfReplaceAll]);
  result := StringReplace(result, '-', '', [rfReplaceAll]);
  result := StringReplace(result, '=', '', [rfReplaceAll]);
  result := StringReplace(result, '#', '', [rfReplaceAll]);
  result := StringReplace(result, '?', '0', [rfReplaceAll]);
  result := StringReplace(result, '.', '0', [rfReplaceAll]);
  result := StringReplace(result, 'X', '0', [rfReplaceAll]);
  result := StringReplace(result, 'x', '0', [rfReplaceAll]);
  if Pos('0', result) > 0 then
    result := StringReplace(result, '*', '', [rfReplaceAll])
  else
    result := StringReplace(result, '*', '0', [rfReplaceAll]);
  if Pos('0', result) > 0 then
    result := StringReplace(result, ' ', '', [rfReplaceAll])
  else
    result := StringReplace(result, ' ', '0', [rfReplaceAll]);
end;

_________________
Simes
www.sadmansoftware.com/sudoku
Back to top
View user's profile Send private message Visit poster's website
scrose

Joined: 01 Jun 2005
Posts: 23
:

Items
PostPosted: Mon Jun 20, 2005 7:50 pm    Post subject: Reply with quote

I've banged out a rough bit of code that doesn't try to figure out what character is being used to represent the empty cell. Instead, it determines the coordinates of the digits in the text input. Using the x and y coordinates, the rows and columns respectively can be deduced. My code below scans the input, extracts the digits, and prints a 9x9 grid with periods to represent the empty cells. It doesn't quite work if the puzzle itself has digitless rows or columns. And it doesn't quite work with the tabbed Excel example. It does, however, successfully tackle grids that use spaces to represent empty cells. I'll try refining this to catch the exceptions.

Edit: Removed my code since it is more-or-less duplicated below.


Last edited by scrose on Tue Jun 21, 2005 12:58 am; edited 2 times in total
Back to top
View user's profile Send private message
scrose

Joined: 01 Jun 2005
Posts: 23
:

Items
PostPosted: Mon Jun 20, 2005 8:19 pm    Post subject: Reply with quote

Another example.

Code:
- - 3 - - - 7 - -
- - 5 - 2 1 - 4 -
- - 6 - 7 8 - 9 -
4 - - 2 - 6 - - 7
9 5 - - - - - 8 6
7 - - 5 - 9 - - 1
- 2 - 8 4 - 5 - -
- 1 - 7 9 - 8 - -
- - 9 - - - 3 - -
Back to top
View user's profile Send private message
scrose

Joined: 01 Jun 2005
Posts: 23
:

Items
PostPosted: Tue Jun 21, 2005 12:54 am    Post subject: Reply with quote

I tinkered with my code a bit and it now successfully recognises all of the examples shown above, except for the tabbed Excel and Pete Wake formats. As long as each of the nine rows and columns has at least once digit, I can recognise the grid without having to determine what character is being used to represent the empty cell. However, if the puzzle happens to have at least one row and/or column that doesn't contain any digits, I look for a character that satisfies the following condition: (number of occurrences of the digits 1 through 9) + (number of occurrences of a particular character) = 81. Because of the simplicity of this condition, it is easy to cook up some grids that cause my code to fail. Here are two examples.

Code:
 8 - - | - - 1 | - 3 -
 - - 1 | 8 - - | - - 4
 - - - | 9 - - | - 7 -
-------+-------+-------
 - 5 - | - - 9 | 6 - -
 - 4 - | - - - | - 5 -
 - - 9 | 7 - - | - 4 -
-------+-------+-------
 - 8 - | - - 2 | - - -
 3 - - | - - 5 | 9 - -
 - 2 - | 1 - - | - - 5


 8     |     1 |   3 
     1 | 8     |     4
       | 9     |   7 
-------+-------+-------
   5   |     9 | 6   
   4   |       |   5 
     9 | 7     |   4 
-------+-------+-------
   8   |     2 |     
 3     |     5 | 9   
   2   | 1     |     5

But since these seem like rather severe cases, I am content to let them slide for now. Here is my Perl code. When I have a little extra time, I may add a couple special cases to handle the tabbed Excel and Pete Wake formats.

Code:
#!/usr/bin/perl -w
# 3x3 Sudoku Text Grid Recognition v0.4
use strict;

# Check the command-line arguments for a filename.
my $filename  = (exists($ARGV[0])) ? $ARGV[0] : "";

# @rawinput (array) holds each line of the original unaltered input.
# %grid (hash of hashes) holds the digits found in the puzzle. The keys of the
#    primary hash are the line numbers of the original input. The keys of the
#    secondary hashes are the column numbers of the original input. The values
#    of the secondary hashes are the digits.
# %cols (hash) calculates how many digits are found in each digit-bearing column
#    of the original input. The keys are the column numbers of the original
#    input. The values are the number of occurrences of digits in a given column.
# %emptycell (hash) holds the varieties and occurrences of non-digit characters
#    in the original input. The keys are the different non-digit characters. The
#    values are the number of occurrences of a given character.
# $line (integer) indicates the line number of the original input.
# $digits (integer) calculates the number of digits in the original input.
# $emptycell (character) is the character that is used to represent an empty cell.
my @rawinput  = ();
my %grid      = ();
my %cols      = ();
my %emptycell = ();
my $line      = 0;
my $digits    = 0;
my $emptycell = "";

# Make sure the input file exists.
until ((-e $filename) || (-e ($filename .= ".txt")))
{
  print("Filename? ");
  chomp($filename = <STDIN>);
}

# Transfer the contents of the input file into an array. Append a single
# character (a new-line seemed like the best choice) to the end of the input;
# later, this allows an array split to work properly on the case where the
# puzzle contains less than nine rows and/or columns of digits and the very last
# character is the empty cell character.
open(INPUT, $filename);
@rawinput = (<INPUT>, "\n");
close(INPUT);

# Warn the user and terminate if the input is invalid.
die("At least nine lines of input are required") if (scalar(@rawinput) < 9);
die("Your input did not contain any digits") unless (join("", @rawinput) =~ /[1-9]/);

# Examine each character of each line of the input. If the character is not 1
# through 9, increment an occurrence of that character in the %emptycell hash.
# Otherwise, place the digit in the %grid hash using as keys the line and column
# numbers it was found at, in the %cols hash increment an occurence of a digit
# occuring in the given column number, and increment the $digits counter.
for $line (0 .. $#rawinput)
{
  my @chars = split(//, $rawinput[$line]);
  for (0 .. $#chars)
  {
    if ($chars[$_] !~ /[1-9]/)
    {
      $emptycell{$chars[$_]}++;
    }
    else
    {
      $grid{$line}{$_} = $chars[$_];
      $cols{$_}++;
      $digits++;
    }
  }
}

# If the puzzle has less than nine rows and/or columns with digits, the input
# must be examined to determine the character that represents the empty cell.
# A search is done on the different non-digit characters (in the %emptycell hash)
# to find one whose occurrences plus the number of digits in the puzzle
# ($digits) total to 81. If such a character cannot be found, inform the user
# that we have failed miserably, and terminate. Otherwise, once more, examine
# each character of each line of the input. If the character matches the empty
# cell character, place a period in the %grid hash using as keys the line and
# column numbers the empty cell was found at, and in the %cols hash increment
# an occurrence of a cell occurring in the given column number.
if ((scalar(keys(%grid)) < 9) || (scalar(keys(%cols)) < 9))
{
  foreach (keys(%emptycell))
  {
    $emptycell = "\\" . $_ if ($emptycell{$_} + $digits == 81);
  }
  die("Cannot determine empty cell character") if ($emptycell eq "");
  for $line (0 ..  $#rawinput)
  {
    my @chars = split(//, $rawinput[$line]);
    for (0 .. $#chars)
    {
      if ($chars[$_] =~ /$emptycell/)
      {
        $grid{$line}{$_} = ".";
        $cols{$_}++;
      }
    }
  }
}

# If the puzzle has more than nine rows and/or columns with digits, inform the
# user of the anomaly and terminate.
elsif ((scalar(keys(%grid)) > 9) || (scalar(keys(%cols)) > 9))
{
  die("Found extra rows and/or columns");
}

# Print the grid. If the puzzle has exactly nine rows and nine columns
# containing digits, periods need to be printed to represent the empty cells.
# Otherwise, periods have already been placed into the %grid hash to explicitly
# define rows and/or columns that do not contain any digits.
foreach $line (sort({$a <=> $b} keys(%grid)))
{
  print((exists($grid{$line}{$_})) ? $grid{$line}{$_} : ".") foreach (sort({$a <=> $b} keys(%cols)));
  print("\n");
}

Interestingly, this script is able to handle the following example. Please post any examples you think will cause my script to fail.

Code:
6   .1.  .. .78
.   ..2  13 .4.

.   ...  6. .1.
.   .29  .. 4..


8   .41  .6 9.3
.   .9.  .2 5..
.   1..  5. ...
.   4.3  87 ...




7   3..  .. 8.6

Edit: Added comments and small improvements to my script.


Last edited by scrose on Tue Jun 21, 2005 2:45 pm; edited 1 time in total
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    Sudoku Programmers Forum Index -> Programming sudoku All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Sudoku Programmers topic RSS feed 


Powered by phpBB © 2001, 2005 phpBB Group

Igloo Theme Version 1.0 :: Created By: Andrew Charron