Ruby Quiz - IP to Country (#139)

IP to Country (#139)

This week's Ruby Quiz is to write a simple utility. Your program should accept an IP address as a command-line argument and print out the two letter code for the country that IP is assigned in. You can find a database for the matching at:

IP to Country Database

To keep the problem interesting though, let's write our programs with a focus on speed and memory efficiency.

$ time ruby ip_to_country.rb 68.97.89.187
US

real 0m0.314s
user 0m0.259s
sys 0m0.053s

Quiz Summary

Matthias Wätcher made a classic mistake this week. He sent in his first ever quiz solution and told us not to expect anything special out of it. We later learned that it was a lot faster than all of the other solutions, thanks to more effort from Matthias. His code taught me some new tricks and I want to share those with you.

But first, let's talk speed.

Most of the solutions are quite quick, relatively speaking. What's the shared secret of speed? Binary search. The records are in sorted order, so it's possible to perform a simple cut-the-possible-matches-in-half-at-each-step lookup. That favored algorithm makes short work of what could otherwise be a lengthy search.

Now you can do a binary search on the database file as is and several solutions did. This is a little trickier, because the records are not of a standard size. You have to be super careful not to accidently skip records. While the solutions handled the challenge very well, you can make things quite a bit easier if you are willing to preprocess the file.

You can also speed things up with some clever preprocessing. That was the other trick Matthias used to find answers so quickly.

Matthias realized that while the actual records contain quite a bit of data, we only really care about the start of a range, the end of a range, and the country code. You can fit all of that in just ten bytes with four for each address and two for the country.

The file also contains some consecutive ranges that can be collapsed for the purposes of our search. Such ranges may differ in some attributes, but as long as their country codes are the same we can treat them as a single unit.

Now that you know the goal, let's take a look at the packing code Matthias sent in:

ruby

#!/usr/bin/ruby
# comment

last_start=nil
last_end=nil
last_country=nil
File.open("packed-ip.dat","wb") do |wfile|
IO.foreach("geo-ip.csv") do |line|
next if !(line =~ /^"/ )
s,e,d1,d2,co=line.delete!("\"").split(",")
s,e = s.to_i,e.to_i
if !last_start
# initialize with first entry
last_start,last_end,last_country = s,e,co
else
if s==last_end+1 and co==last_country
# squeeze if successive ranges have zero gap
last_end=e
else
# append last entry, remember new one
wfile << [last_start,last_end,last_country].pack("NNa2")
last_start,last_end,last_country = s,e,co
end
end
end
# print last entry
if last_start
wfile << [last_start,last_end,last_country].pack("NNa2")
end
end

The three variables declared up front are for tracking the ranges. These will be used to collapse consecutive ranges.

The next bit of code creates the binary database we will write into and parses the CSV formated data. Though the data is in the CSV format, the actual content is trivial and you don't need a proper parser to get at it. As you can see, Matthias just checks for lines starting with a quote, pulls out all quote characters, and split()s on commas. This is faster than using a parser.

The if/else chain in the middle of the code does most of the work. First, it stores the initial range in the tracking variables. For all other ranges it tests to see if it is consecutive with the last one recorded and has the same country code. When it does, the endpoint of the last range is just bumped to include them both. When it doesn't, the last range is written to the file and the code starts tracking the new range. The final if statement ensures that we write out the final range before exiting.

This really isn't too much work and it runs in under two seconds on my box. That's a small price to pay for a speed gain every time we look up an address.

Let's dive into the search code now. Here's how it begins:

ruby

#!/usr/bin/ruby

# take command line or stdin -- the latter has performance advantage
# for long lists
if ARGV[0]
arr=ARGV
else
arr=$stdin
end

# ...

As the comment tells us, Matthias added another way to feed the program addresses. Where I said we should take one from the command-line arguments, this code actually handles any number of addresses from command-line arguments or STDIN.

This next bit of code opens up our database:

ruby

# ...

# the binary table file is looked up with each request
File.open("packed-ip.dat","rb") do |rfile|
rfile.seek(0,IO::SEEK_END)
record_max=rfile.pos/10-1

# ...

The seek() and pos() calls here are used to find the end of the file. You need to know both ends of a data set to perform a binary search. Note that dividing by ten gives us the count of records since they are a fixed width and Matthias subtracts one because we will never need to seek() to the end of the last record again.

Now there's a touch more preparation for each address we want to find:

ruby

# ...

arr.each { |argv|
# build a 4-char string representation of IP address
# in network byte order so it can be a string compare below
ipstr= argv.split(".").map {|x| x.to_i.chr}.join

# ...

In order to avoid a bunch of to_i() calls on each range extracted from the database, Matthias just encodes the address we are hunting for into the character encoding used in the database. This way simple String comparisons can determine if the address it in the range.

We're now ready for the actual search code:

ruby

# ...

# low/high water marks initialized
low,high=0,record_max
while true
mid=(low+high)/2 # binary search median
rfile.seek(10*mid) # one record is 10 byte, seek to position
str=rfile.read(8) # for range matching, we need only 8 bytes
# at comparison, values are big endian, i.e. packed("N")
if ipstr>=str[0,4] # is this IP not below the current range?
if ipstr<=str[4,4] # is this IP not above the current range?
puts rfile.read(2) # a perfect match, voila!
break
else
low=mid+1 # binary search: raise lower limit
end
else
high=mid-1 # binary search: reduce upper limit
end
if low>high # no entries left? nothing found
puts "no country"
break
end
end
}
end

This is a pretty textbook binary search. You always begin by going to the middle of the current low and high. In this case that's a seek() call and we have to remember to multiply the midpoint by our record size of ten.

Once we are at the right record, the code read()s the first eight bytes and compares the address against the low and high for the range. When it is in the range, the final two bytes are read and printed. If the address is below the current range, we drop the high to below the current range. If it's above, we raise the low to above the current range. A final check is added to catch the cases where no match was found, in which case our low will bypass the high.

Remember that a second goal of the quiz described search was to be memory friendly. This code does terrific on that front since only one record needs to be in memory at a time. Once the checks are done, it can be replaced by the next record.

My thanks to all who showed so many great examples of this classic algorithm.

Tomorrow we will write programs that can help me to better understand my wardrobe...