Shirt Reader (#140)

Sun Microsystems handed out some fun shirts at the LSRC. The graphic on the front looks something like:

sun image

star image + T + up arrow

E + perfume bottle image + sea shells image

This is a clever advertisement for their "Sun Startup Essentials" program.

This week's Ruby Quiz is to write a program that can read such shirts. At the basic level, we will assume the user will feed us the right words:

$ ruby read_shirt.rb e scent shells
essentially

Hopefully your solution will be smarter than mine, which, as you can see, only gets close. It may be better to give a few possible choices instead of just the best match.

Of course, the problem comes when you guess the wrong word. My first thought was of a perfume bottle, as my description shows, but the correct word was scent. For bonus points, see what you can do about this problem. One idea might be to allow alternatives for a segment of the word. Another might be to hit the thesaurus, which does list scent as a synonym for perfume.


Quiz Summary

This quiz is surprisingly challenging. Our brain is capable of making some pretty complex associations between sounds to make sense of these puzzles quite quickly. To code that up is a non-trivial task.

All of the solutions used some variant of the Metaphone algorithm to match words with similar sounds. The solution from steve d is probably the most complete. It pulls from a pronunciation dictionary, uses the afore mentioned Metaphone algorithm, refines matches using a Levenshtein edit distance algorithm, and even inlines some C for speed. I do recommend everyone have a look at that code.

I'm going to go through one of Eugene Kalenkovich's solutions in this summary though. It gets pretty good answers in the spot checking I did, it's performance isn't too bad, and it's a bit smaller than steve's code. It starts by pulling in some external resources:

ruby
require 'rubygems'
require 'text'
include Text::Metaphone
include Text::Levenshtein
load 'expectations.rb'

# ...

Most of this code is about loading the Text gem. For those not familiar with it, Text includes a slew of algorithms useful in processing text. This code will use the Metaphone and the improved Double Metaphone algorithms to compare words by sound. It also uses the Levenshtein distance algorithm to see how far apart two possible words are.

The expectations.rb file just fills a global Hash with some test cases.

Another thing most of the solutions did was to perform some translation of the provided words to make them more likely to match sounds. One particular problem point seemed to be "ee" sounds at the end of words, typically ending in y. Performing some phonetic substitutions can increase the matching accuracy in these cases. Here's Eugene's code for that:

ruby
# ...

subs={'1'=>'wan','2'=>'to','3'=>'tre','4'=>'for','5'=>'five','6'=>'six',
'7'=>'seven','8'=>'ate','9'=>'nine','10'=>'ten',
'c'=>'see','h'=>'eich','j'=>'jey','k'=>'key','q'=>'que','r'=>'ar'}
subsy={}
%w[b c d g p t v z].each {|l| subsy[l]=l+'y'}
%w[b c d g p t v z].each {|l| subs[l]=l+'ee'}
%w[f l m n s x].each{|l| subs[l]='e'+l}

# ...

Here we see Hashes constructed to provide the matching of numbers and improve the matching of certain letters. Note that these are just used to match standalone chunks of the input. Substitutions will not be made inside words.

Next we have some wrappers over the algorithms provided in the Text gem:

ruby
# ...

def metadist(str1,str2)
2*distance(metaphone(str1),metaphone(str2))+
distance(str1,str2)
end

def short_double_metaphone(word)
m1,m2=double_metaphone(word)
[m1[0,2],m2 ? m2[0,2] : nil]
end

# ...

The first method is an enhanced distance function for checking how similar two words are. It works by applying the sound algorithm to each words then building and edit distance between those and adding that to the normal edit distance for the words. The sound distance is weighted double in the overall result.

The second wrapper is for building a word index using a Double Metaphone algorithm. It returns shortened versions of the provides sounds for use as Hash keys. The code that builds that Hash is what we want to look at next:

ruby
# ...

hash=Hash.new{|h,k|h[k]=[]}

File.open("/usr/share/dict/words") {|f| f.readlines}.each do |w|
word=w.downcase.delete("^a-z")
m1,m2=short_double_metaphone(word)
hash[m1]<<word
hash[m2]<<word if m2
end
$expectations.values.each { |word|
m1,m2=short_double_metaphone(word)
hash[m1]<<word
hash[m2]<<word if m2
}

hash.each_key{|k| hash[k].uniq!}

# ...

This code just loops over the built-in dictionary (on Unix), building sound keys for each word and populating the Hash. Note that the File loop can be simplified to File.foreach(...) { |w| ... }.

One point of interest here is that the code loops over test cases adding the solution words to the dictionary. This raises a good point: the quality of the dictionary will have a big impact on your results. The built-in dictionary is convenient to test with, but you probably want to trade up to a better word list when quality really counts.

The input handling code is trivial:

ruby
# ...

inputs=[]
if (ARGV.empty?)
inputs=$expectations.keys
else
inputs << ARGV
end

# ...

This just supports the quiz interface, or defaults to running the included test cases.

All we have left is the application code:

ruby
inputs.each { |rebus|
y_ed=rebus[0..-2]<<(subsy[rebus[-1]] || rebus[-1])
word=y_ed.map{|w| subs[w] || w }.join.downcase.gsub(/[^a-z0-9]/,'')
m1,m2=short_double_metaphone(word)
results=hash[m1]
results+=hash[m2] if m2 && m2!=m1
res=results.uniq.sort_by{|a| [metadist(word,a),a.length]}.first(5)
print "'#{rebus.join(' ')}' => #{res[0]}"
expected=$expectations[rebus]
print ", expected '#{expected}' is at position #{res.index(expected)}" \
if expected
puts
}

The process here isn't too hard to follow. The inputs are combined into a list, with the last element undergoing the "ee" sound substitutions we discussed earlier. After that, all elements go through the general letter and number substitutions and get combined into a word.

The sound Hash keys are pulled for the resulting word and used to lookup a result set of possible matches. That result set is then ordered by the enhanced distance function and and pruned to the top five matches.

The rest of the code just handles the printing.

My thanks to all who managed to get surprisingly accurate results for a tough problem. As always, you've taught me new tricks.

It's pretty likely that tomorrow's problem will center around probability...