Learning Tic-Tac-Toe (#11)

This week's Ruby Quiz is to implement an AI for playing Tic-Tac-Toe, with a catch: You're not allowed to embed any knowledge of the game into your creation beyond how to make legal moves and recognizing that it has won or lost.

Your program is expected to "learn" from the games it plays, until it masters the game and can play flawlessly.

Submissions can have any interface, but should be able to play against humans interactively. However, I also suggest making it easy to play against another AI, so you can "teach" the program faster.

Being able to monitor the learning progression and know when a program has mastered the game would be very interesting, if you can manage it.


Quiz Summary

In an interesting contrast, this quiz generated a lot of good discussion, but only two solutions. I believe that may be because the problem turned out to be more complicated than I intended. I know I personally ran into a few complications and didn't have a chance to finish my own solution.

Again, the discussion was excellent and you should probably skim the thread, if you weren't following it this weekend. Multiple Tic-Tac-Toe servers (written in Ruby, of course) were posted, so programs could play against each other remotely.

I posted a message about how Tic-Tac-Toe positions can be transformed by rotation and "mirroring" to other seemly different layouts that can be handled the same.

I also posted a "tictactoe.rb" library that makes most interaction with the game trivially simple.

Finally, a few of us posted some notes about the problems we ran into. I agree with Hans Fugal, who said that you can learn almost as much from those.

The two solutions posted are similar. Basically, they learn to avoid their mistakes over time. They accomplish this by "scoring" the moves they made at each position in a game, based on whether they won or lost. Eventually, this knowledge allows them to select mainly strong moves, simply by remembering how they've done in the past, in the same position.

I'll show Brian Schröder's code here, but both were interesting to examine. Brian's solution contained eight files of Ruby code, the embedded documentation for said files, charts, a write-up of the process, and was all wrapped up in a handy web page, so you can dig as deep as you like into what he's done. (And he once asked where I find the time!) For the purposes of this summary, I'll stick to his learning code.

Here it is:

ruby
class Learning < BasicInterface
attr_accessor :random_prob
attr_reader :player

def initialize
@state_values = Hash.new(0)
@state_transitions = {}
@random_prob = 0.05
end

def new_game(player)
@player = player
@states_visited = []
end

def choose_move(game)
moves = game.moves
if !@state_transitions[game.state_id] or rand < random_prob
move = moves[rand(moves.length)]
else
move_id = @state_transitions[game.state_id].max{ |(ma,sa),(mb,sb)|
@state_values[sa] <=> @state_values[sb]
}[0]
move = moves.select{|m| m.move_id == move_id}[0]
end
move
end

def inform_of_move(before, after, move)
@states_visited << before.state_id << after.state_id
(@state_transitions[before.state_id] ||= {})[move.move_id] =
after.state_id

if after.final?
winner = after.winner
if winner
value = winner == self.player ? 100.0 : -1000.0
else
value = 0.0
end

factor = 1.0
while state = @states_visited.pop
@state_values[state] = (1.0 - factor) * @state_values[state] +
factor * value
factor *= 0.5
end
end
end
end

The initialize() method sets up Brian's @state_values and @state_transitions, which constitute the AI's brain.

@state_values will hold scores for the positions the AI has won or lost with before. @state_transitions holds a "map" of how to get from position to position. When these are filled in, the AI will have "learned" what positions are desirable and how it can reach them.

Knowing this, choose_move() is easy to breakdown. It checks to see if it knows anything about the moves from the current position. If it does, it selects the highest score it can find for itself (else branch). If it doesn't, it goes with a random choice from all available moves (if branch).

The final piece of the puzzle is inform_of_move(). This method remembers all moves made in the game, mainly. When it sees a final position, it scores all those moves based on whether it won or lost. The scoring scale is slanted, to encourage the AI to avoid losses.

That's the heart of Brian's solution. The rest of the code is interface, client/server, a perfect minimax player, and Tic-Tac-Toe details.

For the curious, this quiz was inspired by the research of Donald Michie. In 1961 he built a "machine" that learned to play perfect Tic-Tac-Toe against humans, using matchboxes and beads. He called the machine MENACE (Matchbox Educable Naughts And Crosses Engine).

304 matchboxes where labeled with images of Tic-Tac-Toe positions and filled with colored beads representing possible moves. At each move, a bead would be rattled out of the proper box to determine a move. When MENACE would win, more beads of the colors played would be added to each position box. When it would lose, the beads were left out to discourage these moves.

Michie claimed that he trained MENACE in 220 games, being beaten by it eight out of the final ten games.

Thanks to those who tried this one, successful or not. Sorry it turned out to be a bit involved.

Tomorrow, "How Ruby Can Help You Beat Your Grandmother In Scrabble" is the topic. If your grandmother is anything like mine, you'll appreciate all the help you can get!