Mailing List Files (#115)

The Ruby Talk mailing list archives will show files attached to incoming messages. However, it's not always easy to get at the data from these files using the archives alone. The attachments are sometimes displayed in not-too-readable formats:

An Example

Another Example

This is tough for those of us who like to play with Ruby Quiz solutions.

This week's quiz is to write a program that takes a message id number as a command-line argument and "downloads" any attachments from that message. Assume message ids are for Ruby Talk posts by default, but you may want to provide an option to override that so we can support lists like Ruby Core as well.

If no path is given, write the attachments to the working directory. When there is a path, your code should place the files there instead.


Quiz Summary

I've been playing a little with TMail lately, which is what really inspired this quiz. I thought that a simple solution to this problem would be to pull the pages down with open-uri and then dump them into TMail and just pull the attachments from that. It turns out to be a bit harder to do that than I expected, but one solution did follow that path.

What I love about this plan is the fact that you are just stitching the real tools together. I like leaning on libraries to get tons of functionality with just a few lines of code. Apparently, so does Louis J Scoras! Check out this list of dependencies that kick-starts his solution (I've removed the excellent comments in the code to save space):

ruby
#!/usr/bin/env ruby

require 'action_mailer'
require 'cgi'
require 'delegate'
require 'elif'
require 'fileutils'
require 'hpricot'
require 'open-uri'
require 'tempfile'

# ...

Wow.

Let's start with the standard libraries. Louis pulls in cgi to handle HTML escapes, delegate to wrap existing classes, fileutils for easy directory creation, open-uri to fetch web pages with, and tempfile for creating temporary files, of course. That's an impressive set of tools all of which ship with Ruby.

The other three dependancies are external. You can get them all as gems. action_mailer is a component of the Rails framework used to handle email. Louis doesn't actually use the action_mailer part, just the bundled TMail dependency. This is a trick for getting TMail as a gem.

elif is a little library I wrote as a solution to an earlier quiz (#64). It reads files line by line, but in reverse order. In other words, you get the last line first, then the next to last line, all the way up to the first line.

hpricot is a fun little HTML parser from Why the Lucky Stiff. It has a very unique interface that makes it popular for web scraping usage.

Now that Louis has imported all the tools he could find, he's ready to do some fetching. Here's the start of that code:

ruby
module Quiz115
class QuizMail < DelegateClass(TMail::Mail)
class << self
attr_reader :archive_base_url

def archive_base_url
@archive_base_url ||
"http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/"
end

def solutions(quiz_number)
doc = Hpricot(
open("http://www.rubyquiz.com/quiz#{quiz_number}.html")
)
(doc/'#links'/'li/a').collect do |link|
[CGI.unescapeHTML(link.inner_text), link['href']]
end
end
end

# ...

This object we are examining now is a TMail enhancement, via delegation. This section has some class methods added for easy usability. I believe the attr_reader line is actually intended to be attr_writer though, giving you a way to override the base URL. The reader is defined manually and just defaults to the Ruby Talk mailing list.

The solutions() method is a neat added feature of the code which will allows you to pass in a Ruby Quiz number in order to fetch all the solution emails for that quiz. Here you can see some Hpricot parsing. Its XPath-in-Ruby style syntax is used to pull the solution links off of the quiz page at rubyquiz.com.

Let's get to the real meat of this class now:

ruby
# ...

def initialize(mail)
temp_path = to_temp_file(mail)
boundary = MIME::BoundaryFinder.new(temp_path).find_boundary

@tmail = TMail::Mail.load(temp_path)
@tmail.set_content_type 'multipart', 'mixed',
'boundary' => boundary if boundary

super(@tmail)
end

private

def to_temp_file(mail)
temp = Tempfile.new('qmail')

temp.write(if (Integer(mail) rescue nil)
url = self.class.archive_base_url + mail
open(url) { |f| x = cleanse_html f.read }
else
web = URI.parse(mail).scheme == 'http'
open(mail) { |m| web ? cleanse_html(m.read) : m.read }
end)

temp.close
temp.path
end

def cleanse_html(str)
CGI.unescapeHTML(
str.gsub(/\A.*?<div id="header">/mi,'').gsub(/<[^>]*>/m, '')
)
end
end

# ...

In initialize() the passed mail reference is fetched into a temporary file and a special boundary search is performed, which we will examine in detail in just a moment. The temp file is then handed off to TMail. After that a content_type header is synthesized, as long as we found a boundary.

The actual fetch is made in to_temp_file(). The code that fills the Tempfile is a little tricky there, but all is really does is recognize when we are loading via the web so it can cleanse_html(). That method just strips the tags around the message and unescapes entities.

Now we need to dig into that boundary problem I sidestepped earlier. The messages on the web archives are missing their Content-type header and we need to restore it in order to get TMail to accept the message. With messages that contain attachments, that header should be multipart/mixed. However, the header also points to a special boundary string that divides the parts of the message. We have to find that string so we can set it in the header.

The next class handles that operation:

ruby
# ...

module MIME
class BoundaryFinder
def initialize(file)
@elif = ::Elif.new(file)
@in_attachment_headers = false
end

def find_boundary
while line = @elif.gets
if @in_attachment_headers
if boundary = look_for_mime_boundary(line)
return boundary
end
else
look_for_attachment(line)
end
end
nil
end

private

def look_for_attachment line
if line =~ /^content-disposition\s*:\s*attachment/i
puts "Found an attachment" if $DEBUG
@in_attachment_headers = true
end
end

def look_for_mime_boundary line
unless line =~ /^\S+\s*:\s*/ || # Not a mail header
line =~ /^\s+/ # Continuation line?
puts "I think I found it...#{line}" if $DEBUG
line.strip.gsub(/^--/, '')
else
nil
end
end
end
end
end

# ...

This class is a trivial parser that hunts for the missing boundary. It uses Elif to read the file backwards, watching for an attachment to come up. When it detects that it is inside an attachment, it switches modes. In the new mode if skips over headers and continuation lines until it reaches the first line that doesn't seem to be part of the headers. That's the boundary.

The rest of the code just put's these tools to work:

ruby
# ...

include Quiz115
include FileUtils

def process_mail(mailh, outdir)
begin
t = QuizMail.new(mailh)
if t.has_attachments?
t.attachments.each do |attachment|
outpath = File.join(outdir, attachment.original_filename)
puts "\tWriting: #{outpath}"
File.open(outpath, 'w') do |out|
out.puts attachment.read
end
end
else
outfile = File.join(outdir, 'solution.txt')
File.open(outfile, 'w') {|f| f.write t.body}
end
rescue => e
puts "Couldn't parse mail correctly. Sorry! (E: #{e})"
end
end

def to_dirname(solver)
solver.downcase.delete('!#$&*?(){}').gsub(/\s+/, '_')
end

# ...

process_mail() builds a QuizMail object out of the passed reference number, then copies the attachments from TMail to files in the indicated directory. If the message has no attachments, you just get the full message instead.

to_dirname() is a directory name sanitize for when the code in downloading the solutions from a quiz, as mentioned earlier.

Here's the application code:

ruby
# ...

query = ARGV[0]
outdir = ARGV[1] || '.'

unless query
$stderr.puts "You must specify either a ruby-talk message id, or a
quiz number (prefixed by 'q')"

exit 1
end

if query =~ /\Aq/i
quiz_number = query.sub(/\Aq/i, '')
puts "Fetching all solutions for quiz \##{quiz_number}"

QuizMail.solutions(quiz_number).each do |solver, url|
puts "Fetching solution from #{solver}."

dirname = to_dirname(solver)
solver_dir = File.join(outdir, dirname)

mkdir_p solver_dir
process_mail(url, solver_dir)
end
else
process_mail(query, outdir)
end

exit 0

This code just pulls in the arguments, and runs them through one of two processes. If the number is prefixed with a q, the code scrapes rubyquiz.com for that quiz number and pulls all the solutions. It creates a directory for each solution, then processes each of those messages. Otherwise, it handles just the individual message.

My thanks to those who helped me solve this problem for all quiz fans. We now have an excellent resource to share with people who ask about retrieving the garbled solutions.

Tomorrow, it's back to fun and games for the quiz, but this time we're on a search for pure strategy...