Text Munger (#76)

by Matthew Moss

Now terhe is a fnial rseaon I thnik that Jsues syas, "Lvoe your
emneies." It is tihs: taht love has wtiihn it a remvpidtee pewor. And
three is a pwoer three taht eellvtanuy tfranrmsos idvlinaidus. Taht's
why Juess syas, "Love your emeeins." Bsceaue if you hate your
enmeies, you have no way to reedem and to tarfnrsom your eenmeis. But
if you love yuor emienes, you wlil decsiovr that at the vrey root of
love is the pwoer of rdoemptein. You just keep loinvg pepole and keep
lnivog tehm, even tgouhh they're mteitnsiarg you. Hree's the porsen
who is a nhoeigbr, and tihs psoren is dnoig simhoetng wrong to you and
all of that. Just keep being fnrdliey to that preosn. Keep liovng
them. Don't do atnynhig to earsmrbas tehm. Just keep lvonig them, and
they can't stand it too long. Oh, they raect in mnay ways in the
bineningg. They react wtih brnetitess beucase they're mad bauesce you
lvoe them like that. Tehy raect wtih gluit flegines, and setioemms
they'll hate you a lltite more at that tinoiasrtn piroed, but just
keep lvniog them. And by the poewr of your love tehy will beark down
uendr the load. That's lvoe, you see. It is retpmevide, and tihs is
why Juess says love. Trehe's shimeotng aubot love that blidus up and
is cavrtiee. Trehe is stmeonihg aubot hate that tares dwon and is
disettvrcue. So lvoe your eenmeis.

On first glance, the above may appear to be gibberish, but you may find that you can actually read this portion of a speech from Dr Martin Luther King Jr. The brain has an amazing capacity to compensate for things that aren't quite right, and one study has shown that when the first and last letters of words are left alone but those in the middle are scrambled, the text is often still quite comprehensible.

Your task for this quiz, then, is to take a text as input and output the text in this fashion. Scramble each word's center (leaving the first and last letters of each word intact). Whitespace, punctuation, numbers -- anything that isn't a word -- should also remain unchanged.


Quiz Summary

Obviously, this is not an overly difficult problem. Here's a small, but pretty easy to follow solution by Gordon Thiesfeld:

ruby
class String

def munge
split(/\b/).munge_each.join
end

end

class Array

def munge_each
map { |word| word.split(//).munge_word }
end

def munge_word
first,last,middle = shift, pop,scramble
"#{first}#{middle}#{last}"
end

def scramble
sort_by{rand}
end

end

if __FILE__ == $PROGRAM_NAME

begin
puts File.open(ARGV[0], 'r').read.munge
rescue
puts "Usage: text_munge.rb file"
end

end

The flow here is simple: bust up the document into words, munge all words, and stitch it back together. Munging a word is just separating it into characters and rearranging everything but the first and last character.

Probably the trickiest line in the whole deal is the first and only line in munge(). It breaks the passed document on word boundaries, which will be every place a word begins and ends. Thus, given the sentence:

Here is a simple sentence, for testin' scripts.

Gordon's code will break the document into this Array:

[ "Here", " ", "is", " ", "a", " ", "simple", " ", "sentence", ", ",
"for", " ", "testin", "' ", "scripts", ".\n" ]

It's important to remember that this is the Regular Expression definition of "words", including digit characters and the underscore. That's not a perfect match for the quiz task, but was a popular choice nonetheless.

Now, I did say *all* words are scrambled and that is what I meant. A run of four or more punctuation characters is a word, and the middle punctuation would be scrambled. In practice, this is rare enough to be a minor issue.

I made a bit of a fuss about multi-byte characters during the discussion, which some people did try to satisfy. It's only fair I add detail here.

There are many multi-byte character encodings, but I will focus on just the UTF8 encoding, because I am way out of my league with anything else. If you are unfamiliar with Unicode encodings, this article is a pretty good general introduction:

A Unicode Introduction

The Ruby specifics are harder to come by, sadly.

Basically, Ruby's Unicode support (UTF8 encoding only) is through regular expressions (using matches or methods like split()). They can be made character aware (instead of bytes) by properly setting $KCODE. Here's an example:

$ cat byte_string.rb
#!/usr/local/bin/ruby -w

"résumé".split("").each { |chr| p chr }
$ ruby byte_string.rb
"r"
"\303"
"\251"
"s"
"u"
"m"
"\303"
"\251"
$ cat utf8_string.rb
#!/usr/local/bin/ruby -w

$KCODE = "UTF8"

"résumé".split("").each { |chr| p chr }
$ ruby utf8_string.rb
"r"
"é"
"s"
"u"
"m"
"é"

Notice that when I didn't set $KCODE, the two-byte letter is split. However, when I tell Ruby to be Unicode aware, they stay together.

That should tell you enough background to spot the solutions that can handle it from the ones that can't, giving you more examples to look at. Here's a multi-byte aware solution from Ross Bamford (-Ku is a shortcut for $KCODE = "UTF8"):

ruby
#!/usr/local/bin/ruby -Ku
$stdout << ARGF.read.gsub(/\B((?![\d_])\w{2,})\B/) do |w|
$&amp;.split(//).sort_by { rand }
end

That's mainly just a more compact version of Gordon's script. This time though, we are interested in the results of running it. Watch the é hop around as I run it a few times:

$ ruby Ross\ Bamford/scramble.rb test_document.txt
Actheatd is my rsuémé.
$ ruby Ross\ Bamford/scramble.rb test_document.txt
Aaectthd is my rmséué.
$ ruby Ross\ Bamford/scramble.rb test_document.txt
Aatcethd is my rémsué.

Gordon's solution is non multi-byte aware out of the box. Watch how things change with that:

$ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
Achttead is my résumé.
$ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
Aehttacd is my résumé.
$ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
Aheatctd is my résum?.?

In order to make sense of that, you need to see how the code found the words in that line:

["Attached", " ", "is", " ", "my", " ", "r", "\303\251", "sum", "\303\251.\n"]

See how the last é is lumped in with the end punctuation? That makes the group of characters long enough to scramble. Then they are junk characters my terminal doesn't know how to display.

The good news is, we can magically fix Gordon's script:

$ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
Aatcehtd is my réumsé.
$ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
Athetcad is my rmuésé.
$ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
Atcthead is my rmséué.

We probably can't fix all the solutions like this though. It depends on how they separated the word into letters.

The downside of this is that it makes it harder to recognize word characters, without the digits and underscores. Filtering out punctuation is a lot harder when we expand to such a vast definition of characters. I'm not aware of a good Ruby solution for that issue yet. (Please enlighten me if you are!)

My thanks to Matthew for another great quiz and to all who gave it a shot.

Tomorrow we will build a simple tool for those of you showing off your code in an IRC channel...