To me, this answers the question "How well can neural networks crack sudoku if we forget all of our domain knowledge, and just try to brute force our way through with insufficient training data and a neural network that's poorly sized?".
I suggest an alternate approach that would use a training set consisting of real-world data of sudoku puzzles that are in the process of being solved: before a box is filled in, and then after it's filled in. This would teach the network to solve the puzzle one step at a time, instead of all at once.
If you insist on only using the as-received puzzle and the solution for training, my intuition is that you'll need more layers than 10 in the NN, and much much more data.
A taxonomy of what complex solving strategies need to be learned already exists: see e.g. http://www.sudokuwiki.org/y_wing_strategy and the categories "basic", "tough", diabolical", and "extreme". My guess is that the NN (using the original author's strategy, but with more layers and lots more data) will eventually do well on "basic" and "tough", but will start to miss on some of the latter conditions depending on the training set and how the network is set up.
To me, it's an interesting question to ask how well NNs can perform without hand-holding, i.e. can they figure out all of the complex strategies that humans have figured out, starting from a blank slate?
C.f. AlphaGo, which invented new strategies that proved interesting to human players.
Of course, if you're training your NN to solve a real-world problem, you wouldn't choose to train it in this way, you would do as you suggested and feed it as much relevant training data as possible, to give it a head-start.
(Your points RE: sizing and quantity of training data sound reasonable and are out of my expertise, so nothings to add there).
Yeah, yeah, sudoku is easy for computers. That's not the point. The point is learning about neural networks. So yeah, "adding a neural net" does help with that.
In which case the hard part is finding a suitable problem; you might be better off learning about SMT solvers because you don't have Marc Cuban telling everybody to do it.
I just feel bad about this "my intuition is that you'll need more layers than 10 in the NN, and much much more data."
I still find current AI as basically a filter. Image to text, speech to text e.tc
A smart AI would be able to figure out sudoku rules from a very small sample of games, it would then figure out a rudimentary backtracking algorithm. It would then sleep over it and figure out more patterns like naked singlets e.t.c and build a really fast sudoku solver with 100% accuracy. All this happening in a matter of seconds.
Pigs dance all the time for suitable definitions of dance and time.
I don't think optimizing a rats nest of functions with SGD is that impressive. If a model lacks explanatory power what use is it to people?
In this case when the solution from the NN is incorrect what recourse do you have? The actual solution might be an arbitrary permutation of what the NN gave you and there is no way to tell which rows and columns will have to be reshuffled to get the actual solution or even if there is a solution. The constraints might be inconsistent and you will never know.
My immediate thought was to generate training data by making up solved puzzles, and then removing numbers one at a time (while ensuring there's still a unique solution), with a few different branches at each step from an initial solution.
You could generate massive datasets that way -- it would be pretty easy to generate a few billion pre-images of a solution. I mean, a solution has 80 numbers and a puzzle about 20. 80 choose 20 is about 10^18, or a billion billion.
I suggest an alternate approach that would use a training set consisting of real-world data of sudoku puzzles that are in the process of being solved: before a box is filled in, and then after it's filled in. This would teach the network to solve the puzzle one step at a time, instead of all at once.
If you insist on only using the as-received puzzle and the solution for training, my intuition is that you'll need more layers than 10 in the NN, and much much more data.
A taxonomy of what complex solving strategies need to be learned already exists: see e.g. http://www.sudokuwiki.org/y_wing_strategy and the categories "basic", "tough", diabolical", and "extreme". My guess is that the NN (using the original author's strategy, but with more layers and lots more data) will eventually do well on "basic" and "tough", but will start to miss on some of the latter conditions depending on the training set and how the network is set up.