Whiteout (#34)

Perl programmers have all the fun. They have an entire section of the CPAN devoted to their playing around. The ACME modules are all fun little toys that have interesting if rarely very useful effects.

This week's Ruby Quiz is to port ACME::Bleach to Ruby. I won't make you go hunting through the CPAN to figure it out though. Here's how our version will work:

1. Make a Ruby file that is both an executable and a library. We'll call
it "whiteout".
2. When "whiteout" is executed, it should take ARGV to be a list of Ruby
source code files to alter in-place. (You may save backup copies if you
like, but the original file should be changed.) Here are the changes:
a. A Shebang line, if present is to be passed through the filter
without any changes.
b. The script should then add the line: require "whiteout"
c. The entire rest of the file should be made invisible. You might do
this by converting the rest of the file to whitespace, as
ACME::Bleach does.
3. When "whiteout" is required, the original code must be executed with no
change in functionality.

Let's show those Perl guys that we know how to have a good time too!


Quiz Summary

Does this library have any practical value? Probably not. It's been suggested in the Perl community that hacks like this are a good minor deterrent to those trying to read source code you would rather keep hidden, but it must be stressed that this is no form of serious security. Regardless, it's a fun little toy to play with.

It was mentioned in the discussion that Perl, where ACME::Bleach comes from, includes a framework for source filtering. It can be used to make modules that modify source code much as we are doing in this quiz. Perl's Switch.pm is a good example of this, but ironically ACME::Bleach is not.

That naturally leads to the question, can you build source filters in Ruby? Clearly we can build ACME::Bleach, but not all source filters are as simple I'm afraid. Consider this:

ruby
#!/usr/local/bin/ruby -w

require "fix_my_broken_syntax"

invalid++

Now the thought here is that fix_my_broken_syntax.rb will read my source, change it so that it does something valid, eval() it, and exit() before the invalid code is an issue. Here's a trivial example of fix_my_broken_syntax.rb:

ruby
#!/usr/local/bin/ruby -w

puts "Fixed!"
exit

Does that work? Unfortunately, no:

$ ruby invalid.rb
invalid.rb:5: syntax error
invalid++
^

Ruby never gets to loading the library, because it's not happy with the syntax of the first file. That makes writing a source filter for anything that isn't valid Ruby syntax complicated and if it is valid Ruby syntax, you can probably just code it up in Ruby to begin with.

Except for whiteout.rb, our version of ACME::Bleach.

You can't build Ruby constructs out of whitespace alone, so some form of source filtering is required. Luckily, we can get away with the approach described above for this source filter, because a bunch of whitespace (with no code) is valid Ruby syntax. It just doesn't do anything. Ruby will skip right over our whitespace and load the library that restores and runs the code.

Most people took this approach. Let's examine one such example by Robin Stocker:

ruby
#!/usr/bin/ruby

#
# This is my solution for Ruby Quiz #34, Whiteout.
# Author:: Robin Stocker
#

#
# The Whiteout module includes all functionality like:
# - whiten
# - run
# - encode
# - decode
#
module Whiteout

@@bit_to_code = { '0' => " ", '1' => "\t" }
@@code_to_bit = @@bit_to_code.invert
@@chars_to_ignore = [ "\n", "\r" ]

#
# Whitens the content of a file specified by _filename_.
# It leaves the shebang intact, if there is one.
# At the beginning of the file it inserts the require 'whiteout'.
# See #encode for details about how the whitening works.
#
def Whiteout.whiten( filename )
code = ''
File.open( filename, 'r' ) do |file|
file.each_line do |line|
if code.empty?
# Add shebang if there is one.
code << line if line =~ /#!\s*.+/
code << "#{$/}require 'whiteout'#{$/}"
else
code << encode( line )
end
end
end
File.open( filename, 'w' ) do |file|
file.write( code )
end
end

# ...

First, we can see that the module defines some module variables, which are really used as constants here. Their contents hint at the encoding algorithm we'll see later.

Then we have a method for managing the transformation of the source into whitespace. It starts by opening the passed file and reading the code line-by-line. If the first line is a shebang line, it's saved in the variable code. Next, a "require 'whiteout'" line is added to code. Finally, all other lines from the file are appended to code after being passed through an encode() method we'll examine shortly. With the contents read and transformed, the method then reopens the source for writing and dumps the modifications into it.

The next method is the reverse process:

ruby
# ...

#
# Reads the file _filename_, decodes and runs it through eval.
#
def Whiteout.run( filename )
text = ''
File.open( filename, 'r' ) do |file|
decode = false
file.each_line do |line|
if not decode
# We don't want to decode the "require 'whiteout'",
# so start decoding not before we passed it.
decode = true if line =~ /require 'whiteout'/
else
text << decode( line )
end
end
end
# Run the code!
eval text
end

# ...

This method again reads the passed file. It skips over the "require 'whiteout'" line, then copies the rest of the file into the variable text, after passing it through decode() line-by-line. The final line of the method calls eval() on text, which should now contain the restored program.

On to encode() and decode():

ruby
#
# Encodes text to "whitecode". It works like this:
# - Chars in @@char_to_ignore are ignored
# - Each byte is converted to its bit representation,
# so that we have something like 01100001
# - Then, it is converted to whitespace according to @@bit_to_code
# - 0 results in a " " (space)
# - 1 results in a "\t" (tab)
#
def Whiteout.encode( text )
white = ''
text.scan(/./m) do |char|
if @@chars_to_ignore.include?( char )
white << char
else
char.unpack('B8').first.scan(/./) do |bit|
code = @@bit_to_code[bit]
white << code
end
end
end
return white
end

#
# Does the inverse of #encode, it takes "white"
# and returns the decoded text.
#
def Whiteout.decode( white )
text = ''
char = ''
white.scan(/./m) do |code|
if @@chars_to_ignore.include?( code )
text << code
else
char << @@code_to_bit[code]
if char.length == 8
text << [char].pack("B8")
char = ''
end
end
end
return text
end

end

# ...

The comments in there detail the exact process we're looking at here, so I'm not going to repeat them.

Note that @@char_to_ignore contains "\n" and "\r" so they are not translated. The effect of that is that line-endings are untouched by this conversion. Some solutions used such characters in their encoding algorithm. The gotcha there is that any line-ending translation done to the modified source (say FTP through ASCII mode) will break the hidden code. Robin's solution doesn't have that problem.

Here's the code that ties all those methods into a solution:

ruby
# ...

#
# And here's the logic part of whiteout.
# If it was run directly, whites out the files in ARGV.
# And if it was required, decodes the whitecode and runs it.
#
if __FILE__ == $0
ARGV.each do |filename|
Whiteout.whiten( filename )
end
else
Whiteout.run( $0 )
end

Again, the comment saves me some explaining.

That was Robin's first solution to a Ruby Quiz, but I never would have known that from looking at the code. Thanks for sharing Robin!

Obviously, a conversion of this type grossly inflates the size of the source. Around eight times the size, to be exact. A couple of solutions used zlib to control the expansion, which I thought was clever. By compressing the source and then encoding() (and using a base three conversion) Dominik Bathom got results around three times the inflation instead.

Ara.T.Howard took a different approach, using whiteout.rb as a database to store the trimmed files. That was a very interesting process, demonstrated well in the submission email. The advantages to this approach would be no inflation penalty and the code stays readable (just not in the original location). The disadvantage I see is that it requires the exact same library to be present both at encoding and decoding, which probably makes sharing the altered code impractical.

As always, my thanks to all who gave this little diversion an attempt. I'm sure we'll see tons of whitespace only code on RubyForge in the future, thanks to our efforts.

Tomorrow begins part one of our first two-part Ruby Quiz. Stay tuned...