TumbleDRYer (#53)

by Hugh Sasse

The Pragmatic Programmers (and others) recommend that you remove repetition from code. They call this principle DRY: Don't Repeat Yourself. Benefits include not having to track down multiple places to make a change affecting the whole system, prevention of inconsistency, as well as reduced debugging time. Sometimes code ends up being repetitive. Take this MySQL example from:

Rails Cookbook

CREATE TABLE `authors` (
`id` int(11) NOT NULL auto_increment,
`firstname` varchar(50) NOT NULL default '',
`name` varchar(50) NOT NULL default '',
`nickname` varchar(50) NOT NULL default '',
`contact` varchar(50) NOT NULL default '',
`password` varchar(50) NOT NULL default '',
`description` text NOT NULL,
PRIMARY KEY (`id`)
) TYPE=MyISAM AUTO_INCREMENT=3 ;
CREATE TABLE `categories` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(20) NOT NULL default '',
`description` varchar(70) NOT NULL default '',
PRIMARY KEY (`id`)
) TYPE=MyISAM AUTO_INCREMENT=3 ;
CREATE TABLE `categories_documents` (
`category_id` int(11) NOT NULL default '0',
`document_id` int(11) NOT NULL default '0',
) TYPE=MyISAM ;
CREATE TABLE `documents` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(50) NOT NULL default '',
`description` text NOT NULL,
`author_id` int(11) NOT NULL default '0',
`date` date NOT NULL default '0000-00-00',
`filename` varchar(50) NOT NULL default '',
PRIMARY KEY (`id`),
KEY `document` (`title`)
) TYPE=MyISAM AUTO_INCREMENT=14 ;

clearly there's a lot of repetition in there. It would be nice if it were possible to eliminate it. This quiz is about writing a program which we'll call TumbleDRYer, which, given repetitive input, will create ruby code to generate that input, such that the generating program has much less repetition in it. One way might be to generate a program like this:


# Undoes the work of TumbleDRYer :-)
class WashingMachine
def initialize
@create = 'CREATE TABLE'
@type = 'TYPE=MyISAM'
@varchar_50 = "varchar(50) NOT NULL default '',"
@varchar_20 = "varchar(20) NOT NULL default '',"
@int_11 = "int(11) NOT NULL auto_increment,"
@int_11_0 = "int(11) NOT NULL default '0',"
@pk = "PRIMARY KEY (`id`)"
@auto = "AUTO_INCREMENT="
end

def output
print <<EOF
#{@create} `authors` (
`id` #{@int_11}
`firstname` #{@varchar_50}
`name` #{@varchar_50}
`nickname` #{@varchar_50}
`contact` #{@varchar_50}
`password` #{@varchar_50}
`description` text NOT NULL,
#{@pk}
) #{@type} #{@auto}3 ;
#{@create} `categories` (
`id` #{@int_11}
`name` #{@varchar_20}
`description` varchar(70) NOT NULL default '',
#{@pk}
) #{@type} #{@auto}3 ;
#{@create} `categories_documents` (
`category_id` #{@int_11_0}
`document_id` #{@int_11_0}
) #{@type} ;
#{@create} `documents` (
`id` #{@int_11}
`title` #{@varchar_50}
`description` text NOT NULL,
`author_id` #{@int_11_0}
`date` date NOT NULL default '0000-00-00',
`filename` #{@varchar_50}
#{@pk},
KEY `document` (`title`)
) #{@type} #{@auto}14 ; EOF
end
end

WashingMachine.new.output

or something of the kind. Or it might be worth using ERB.

Obviously the more human readable to the resulting program is the better: the point is to produce something that simplifies maintenance, rather than data compression. Techniques from data compression would seem to be useful, however. Building up a dictionary of character groups and their frequencies might be useful, and various strategies are possible for how to fragment the string into repeated chunks, such as: the longest repeated substring; splitting on some kind of boundary; or techniques from Ziv-Lempel coding to create the groups.


Quiz Summary

Great quiz Hugh! This is one of those unique ideas that was a lot of fun to play with. There's always some neat things to be accomplished with code that writes code.

I want to take a look at Bob Showalter's solution below. It's really just one linear procedure, so I'll show all the code and then break it down:

ruby
require 'strscan'
require 'abbrev'

class TumbleDRYer

# minimum length of a phrase to consider
MIN_PHRASE = 10

# minimum times a phrase must occur to consider
MIN_OCCUR = 3

# minimum length for abbreviation
MIN_ABBR = 2

def initialize(string)
@input = string
end

def dry

# this will accumulate a list of repeated phrases to condense
phrases = Array.new

# this will receive the abbreviation for each phrase
abbr = Hash.new

lines = @input.to_a
loop do

# process the input data by lines. we find "phrases" by
# first finding the start and end of each "word" in the line,
# and then combining those words into longer phrases. for
# each phrase, we count the number of times it occurs in the
# total input.
phr = Hash.new
lines.each do |line|
s = StringScanner.new(line)
words = Array.new
loop do
s.scan_until(/(?=\S)/) or break
beg = s.pos
s.scan(/\S+/)
words << [ beg, s.pos ]
end

# combine words to make 'phrases'
combos(words)

# accumulate phrases, counting their occurences.
# skip phrases that are too short.
words.each do |from, to|
p = line[from, to - from]
next unless p.length >= MIN_PHRASE
phr[p] ||= 0
phr[p] += 1
end
end

# get the longest phrase that occurs the most times
longest = phr.sort_by { |k,v| -(k.length * 1000 + v)
}.find { |k,v| v >= MIN_OCCUR } or break
phrase, occurs = longest

# save the phrase, and then blank it out of the input data
# so we can search for more phrases
phrases << phrase
lines.each { |line| line.gsub!(phrase, ' ' * phrase.length) }

end

# now we have all the phrases we want to replace.
# find unique abbreviations for each phrase.
temp = Hash.new
phrases.each do |phrase|
key = phrase.scan(/\w+/).flatten.to_s.downcase
key = '_' + key unless key =~ /^[_a-zA-Z]/
key += '_' while temp.has_key? key
temp[key] = phrase
end
temp.keys.abbrev.sort.each do |s, key|
phrase = temp[key]
abbr[phrase] = s if abbr[phrase].nil? ||
abbr[phrase].length < MIN_ABBR
end

# generate the output class
puts "class WashingMachine"
puts " def initialize"
phrases.each do |phrase|
puts ' @' + abbr[phrase] + " = '" +
phrase.gsub("'", "\\\\'") + "'"
@input.gsub!(phrase, '#{@' + abbr[phrase] + '}')
end
puts " end\n"
puts " def output\nprint <<EOF"
puts @input
puts "EOF\n end\n"
puts "end"

end

private

def combos(arr, max = arr.size - 1, i = 0)
(i+1..max).each do |j|
arr << [ arr[i][0], arr[j][1] ]
end
combos(arr, max, i + 1) if i < max - 1
end

end

TumbleDRYer.new(ARGF.read).dry

That's the code, save one unused line I removed.

The first thing I do when reading code is to try and understand the process. To get the lay of the land, we need to know what's called where. The path of execution is trivial to find here. The last line is the only code, besides a couple of require statements, outside of a class definition. We know that's what Ruby will run.

That line tells us that we need to be examining two methods TumbleDRYer#initialize and TumbleDRYer#dry. Don't be shy with that scrollbar now. Zip around until you find them. TumbleDRYer#initialize is easy enough, all it does is record the passed input. TumbleDRYer#dry is a monster, so let's come back to it.

Anything else in this file? Well, we can see that two standard libraries are used, strscan and abbrev. We can also see three constants at the top of the class that start to give us ideas about how the code will break down the sections it is meant to simplify.

Beyond that, there's only one other method, which we can assume is a helper to TumbleDRYer#dry. It's smaller, so let's see if we can figure it out first. Uh oh, recursion! My brain just melted and leaked out of my ear. Okay, let's see if we can cheat and just call it. Copy and paste and irb to the rescue:

>> def combos(arr, max = arr.size - 1, i = 0)
>> (i+1..max).each do |j|
?> arr << [ arr[i][0], arr[j][1] ]
>> end
>> combos(arr, max, i + 1) if i < max - 1
>> end
=> nil
>> combos( (1..10).to_a )
=> nil

Or not. I was hoping for a more informative return value obviously.

Okay, it looks like I actually need to read a little. I see that `arr` is a parameter of the method. That's how I decided on a sample data set. A little farther in I can see `arr << ...`. Bingo. It modifies the array. Now I know how to check it:

>> test_data = (1..10).to_a
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>> combos test_data
=> nil
>> test_data
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, [1, 1], [1, 1], [1, 0], [1, 0], [1, 1],
[1, 1], [1, 0], [1, 0], [1, 1], [0, 1], [0, 0], [0, 0], [0, 1], [0, 1],
[0, 0], [0, 0], [0, 1], [1, 0], [1, 0], [1, 1], [1, 1], [1, 0], [1, 0],
[1, 1], [0, 0], [0, 1], [0, 1], [0, 0], [0, 0], [0, 1], [1, 1], [1, 1],
[1, 0], [1, 0], [1, 1], [0, 1], [0, 0], [0, 0], [0, 1], [1, 0], [1, 0],
[1, 1], [0, 0], [0, 1], [1, 1]]

Yuck. In the dictionary under "unhelpful", you will find the above listing. More reading is required.

Well, if I finish the line I stopped at last time, I find this little nugget: `[ arr[i][0], arr[j][1] ]`. That makes me think `arr` is expected to be an Array of Arrays. It looks like the subarrays only need two items, since the indexes are 0 and 1. Third try is a charm:

>> test_data = [1, 3, 5, 7, 9].map { |n| [n, n + 1] }
=> [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
>> combos test_data
=> nil
>> test_data
=> [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [1, 4], [1, 6], [1, 8], [1, 10],
[3, 6], [3, 8], [3, 10], [5, 8], [5, 10], [7, 10]]

That looks promising. Now what are we seeing? Obviously it added some groupings to the end of the Array. 1 was already seen with 2, but it also paired it with 4, 6, 8, and 10. That's all of the other second elements. Then 3 is paired with 6, 8, and 10 and it was already paired with 4, which leaves only 2 missing. Now I see the pattern. Each first element is paired with all of the second elements that come after it, in the original Array.

Even knowing how that works, I'm not sure I understand what that is actually for yet. I'll file it away in the back of my mind and see if I'm ready to tackle TumbleDRYer#dry.

Here's the first chunk of that method, to save you some scrolling:

ruby
def dry

# this will accumulate a list of repeated phrases to condense
phrases = Array.new

# this will receive the abbreviation for each phrase
abbr = Hash.new

lines = @input.to_a
loop do

# process the input data by lines. we find "phrases" by
# first finding the start and end of each "word" in the line,
# and then combining those words into longer phrases. for
# each phrase, we count the number of times it occurs in the
# total input.
phr = Hash.new
lines.each do |line|
s = StringScanner.new(line)
words = Array.new
loop do
s.scan_until(/(?=\S)/) or break
beg = s.pos
s.scan(/\S+/)
words << [ beg, s.pos ]
end

# combine words to make 'phrases'
combos(words)

# accumulate phrases, counting their occurrences.
# skip phrases that are too short.
words.each do |from, to|
p = line[from, to - from]
next unless p.length >= MIN_PHRASE
phr[p] ||= 0
phr[p] += 1
end
end

# get the longest phrase that occurs the most times
longest = phr.sort_by { |k,v| -(k.length * 1000 + v)
}.find { |k,v| v >= MIN_OCCUR } or break
phrase, occurs = longest

# save the phrase, and then blank it out of the input data
# so we can search for more phrases
phrases << phrase
lines.each { |line| line.gsub!(phrase, ' ' * phrase.length) }

end

# ...

Luckily, that's well commented code. The comments easily explain what the variables are for, and then we get an introduction to the process in `loop do ... end`. As soon as I read that, TumbleDRYer#combos clicks into place.

The Array of Arrays passed to TumbleDRYer#combos holds a start and end position for each word. The start positions are combined with all of the end positions after that word to give the possible phrases.

Before the call to TumbleDRYer#combos, we can see the Array of Arrays being created, with a little StringScanner work. Words are found by scanning non-whitespace characters, because replacing at any other boundaries pretty much kills the human readable goal of the output.

Just beneath that call to TumbleDRYer#combos the phrases are assembled, checked for the minimum length, and tallied in the phrase count. The chunk of code right after that selects the longest phrase that occurs the most times for substitution. Note the clever use of negation in #sort_by here, to reverse the order. The last couple of lines save the phrase and replace it with whitespace, so it won't match in future iterations.

Remember, this is all inside of a big `loop do ... end`, so by the time we break out of here, all the replacement phrases will have been selected.

ruby
# ...

# now we have all the phrases we want to replace.
# find unique abbreviations for each phrase.
temp = Hash.new
phrases.each do |phrase|
key = phrase.scan(/\w+/).flatten.to_s.downcase
key = '_' + key unless key =~ /^[_a-zA-Z]/
key += '_' while temp.has_key? key
temp[key] = phrase
end
temp.keys.abbrev.sort.each do |s, key|
phrase = temp[key]
abbr[phrase] = s if abbr[phrase].nil? ||
abbr[phrase].length < MIN_ABBR
end

# ...

This section of the method just finds names for each of the replaced phrases. It generally choses the first few letters or numbers of the phrase, unless it needs to insert a _ character or two to maintain Ruby syntax or break ambiguities. A lot of the work here is done by the abbrev library. If you're not familiar with how that works, ask irb:

>> require "abbrev"
=> true
>> names = %w{James Edward Gray II}
=> ["James", "Edward", "Gray", "II"]
>> names.respond_to? :abbrev
=> true
>> names.abbrev
=> {"II"=>"II", "Gr"=>"Gray", "Gra"=>"Gray", "Jam"=>"James", "Ja"=>"James",
"Edw"=>"Edward", "Edwa"=>"Edward", "Edwar"=>"Edward", "Gray"=>"Gray",
"James"=>"James", "Edward"=>"Edward", "E"=>"Edward", "G"=>"Gray",
"Jame"=>"James", "I"=>"II", "Ed"=>"Edward", "J"=>"James"}

It turns an Array into a Hash of all the unique abbreviations that can be used to form the words in the original Array.

Last little bit of code:

ruby
# ...

# generate the output class
puts "class WashingMachine"
puts " def initialize"
phrases.each do |phrase|
puts ' @' + abbr[phrase] + " = '" +
phrase.gsub("'", "\\\\'") + "'"
@input.gsub!(phrase, '#{@' + abbr[phrase] + '}')
end
puts " end\n"
puts " def output\nprint <<EOF"
puts @input
puts "EOF\n end\n"
puts "end"

end

That just outputs the DRYed code to STDOUT. This output uses a Ruby heredoc with String interpolation. Unfortunately, I think that means that Ruby code, with its own String interpolation, will confuse it. Have a look at Dominik Bathon's code for custom interpolation or Daniel Sheppard's code for ERb templates that even make use of arguments in replacement.

As always my thanks to all the creative souls that gave this quiz a try. All of the solutions are educational.

Ruby Quiz will take a week off now so I can run up to Denver and wish my sister, Nicole Blake Thanner, a happy sweet sixteenth birthday!