Posix Pangrams (#97)

by Martin DeMello

A pangram is a sentence containing every letter of the alphabet at least once (a famous example in English being "the quick brown fox jumps over the lazy dog"). For maximum style points a pangram should read smoothly, and have both as few repeated letters as possible (ideally zero), and as few words as possible.

This quiz extends the idea to the posix utilities - write a program to find pangrammatic collections of posix utilities that (1) use the fewest utilities and (2) have the minimum number of repeated letters. In either case, break ties on the other criterion; that is, your first solution should also have as few repeated letters as possible, and your second one should use as few utilities as possible.


Quiz Summary

This problem turns out to be pretty tough. Finding the best pangram for a given constraint can be NP-Hard, as explained by Ken Bloom:

NP-Hard Reduction

Given that, we're just interested in code that gets reasonably close to ideal results in a decent amount of time. All of the submissions did that quite well, but I want to show off Ezwan Aizat Bin Abdullah Faiz's code below. I've Rubyified it just a bit, but the code still works the same. For reference, it finds this pangram in under a second on my box:

String: zcat jobs newgrp mv qhold tty uux mkfifo
Panagram? true
Words? 8
Length? 33
Dupes? 7

Let's start with the easy stuff:

ruby
#!/usr/bin/ruby

# removed c99 fort77 m4
words = %w[ admin alias ar asa at awk basename batch bc bg cal cat cd
cflow chgrp chmod chown cksum cmp comm command compress cp
crontab csplit ctags cut cxref date dd delta df diff dirname
du echo ed env ex expand expr false fc fg file find fold
fuser gencat get getconf getopts grep hash head iconv id
ipcrm ipcs jobs join kill lex link ln locale localedef logger
logname lp ls mailx make man mesg mkdir mkfifo more mv newgrp
nice nl nm nohup od paste patch pathchk pax pr printf prs ps
pwd qalter qdel qhold qmove qmsg qrerun qrls qselect qsig
qstat qsub read renice rm rmdel rmdir sact sccs sed sh sleep
sort split strings strip stty tabs tail talk tee test time
touch tput tr true tsort tty type ulimit umask unalias uname
uncompress unexpand unget uniq unlink uucp uudecode uuencode
uustat uux val vi wait wc what who write xargs yacc zcat ]

words_line = words.join(" ")

# ...

This code begins by filling an Array with the Posix utilities, except for the three that contain digits in their names. Everyone dropped these to keep things simple. Note that the words are also joined to form one long line of utility names.

Next we need to enhance String a bit with some capabilities we will need to do our work:

ruby
# ...

class String
def letters(&block)
scan(/[a-z]/, &block)
end

def pangram?
letters.uniq.length == 26
end

def duplicated_letters
seen = Hash.new { |found, char| found[char] = 1; 0 }
letters.inject(0) { |sum, l| sum + seen[l] }
end
end

# ...

We will need easy access to an Array of letters (without the spaces), so a simple wrapper is created over scan(). From there we can check pangram status by ensuring that we have 26 unique letters. Finally, when printing statistics it would be nice to be able to see how many letters are duplicates, so we add a method for that too.

Now the code builds up some handy constants in preparation for the search:

ruby
# ...

OCCURRENCE = Hash.new { |counts, char| counts[char] = 0 }
words_line.letters { |char| OCCURRENCE[char] += 1 }

WORDS_LENGTH = words_line.delete(" ").length

# ...

Here we see the OCCURRENCE Hash being filled. It will be the key to the search algorithm. The Hash is simply a count of how many times each letter is present in all of the names. "e" appears the most at a count of 62 times and "z" is present only once.

Another constant is filled with the overall character count.

Here's a method used to print statistics when we have our answer:

ruby
# ...

def stats(line)
puts "String: #{line}"
puts "Pangram? #{line.pangram?}"
puts "Words? #{line.split.length}"
puts "Length? #{line.delete(' ').length}"
puts "Dupes? #{line.duplicated_letters}"
end

# ...

I think that one is pretty self documenting.

We're finally ready for the primary search algorithm and here it comes:

ruby
# ...

=begin
Suitability should be determined by
* least number of letters takes out the length of the computer
* no used letters
* no duplicates
=end

def suitability(value, line)
amount, used = 0, ""

value.letters do |char|
amount += if used.include?(char) || line.include?(char)
-OCCURRENCE[char]
else
WORDS_LENGTH / OCCURRENCE[char]
end
used << char
end

amount
end

# ...

This method grades the passed value for inclusion in the line. Each letter is scored, creating an overall suitability rating for this value.

If the letter is already in the line or the letters we have seen so far from this value, the OCCURRENCE value for that letter is tacked onto the score as a penalty. This ensures the duplicates, especially of common letters, are costly so the code will try to avoid them.

If this is the first time the letter has been encountered, the score is the OCCURRENCE for that letter divided into the total letter count. This makes it valuable for the code to place letters like the "z" early on, since we don't have any choices on that one and thus will also need to accept the letters it comes with.

Here's the final piece of the puzzle, where we see suitability() put to work:

ruby
# ...

line = ""

until line.pangram?
words.sort! do |x, y|
suitability(y, line) <=> suitability(x, line)
end

line += "#{words.shift} "
end

stats line

The process is easy to follow here. First an empty line is created. Then we loop until the line has become a pangram. At each step, the utility list is sorted by suitability() and the best word added to the line. Just before exit, the program dumps the statistics as the result.

My thanks to all you pangram hunters. As always you taught me clever new tricks in optimization.

Tomorrow we'll play with the king of pathfinding algorithms...