Sampling (#39)

This week's Ruby quiz is a classic sampling problem that I've seen many interesting solutions for in the past.

The challenge is to implement a program called "sample" that takes exactly two integer inputs. The first of those should be the number of members chosen at random you would like included in the sample. The second is the upper boundary (exclusive) of the range of integers members can be selected from. The lower boundary is zero (inclusive).

Your program should output exactly the requested number of members from the defined range to STDOUT, one number per line. Each member must be unique and they should appear in sorted order.

Here are some sample runs of my solution to illustrate the task:

$ ./sample
Usage: sample MEMBERS LIMIT

MEMBERS: The number of members you want in the sample. (Integer)
LIMIT: The upper limit for the sample range. All (Integer)
members will between 0 and LIMIT - 1.

Note that MEMBERS must be less than LIMIT.
$ ./sample 3 10
0
1
5
$ ./sample 3 10
1
2
8
$ ./sample 3 10
2
3
9
$ ./sample 9 10
0
2
3
4
5
6
7
8
9
$ ./sample 20 400
1
3
4
25
32
36
42
56
93
111
137
139
153
213
222
226
263
289
293
314

Now for all the algorithm junkies out there that enjoy a little friendly competition, here's the time to beat:

$ time ./sample 5_000_000 1_000_000_000 > big_sample.txt

real 30m10.593s
user 29m54.544s
sys 0m4.736s
$ ls -l big_sample.txt
-rw-r--r-- 1 james staff 49011436 Jul 11 15:26 big_sample.txt
$ head big_sample.txt
112
221
450
655
900
1216
1399
1469
1494
1628
$ tail big_sample.txt
999995325
999996330
999996552
999996682
999997426
999997676
999998253
999999153
999999658
999999693


Quiz Summary

To me, the solutions to this week's quiz are a fascinating combination of the various degrees of correct, dangerous, and fast. Let's examine a few different solutions and I'll try to show you why I think that.

The fast characteristic seemed to be in abundant supply. Ezra Zygmuntowicz started scaring people with time readouts before solutions were even posted. Let's start with that code:

ruby
#!/usr/local/bin/ruby
members, limit, index = ARGV[0].to_i, ARGV[1].to_i, 0
member_range = limit / members
0.upto(members-1) do
res = rand member_range
puts res + index
index += member_range
end

How does this work?

We can see that it reads in the two settings and initializes an index to zero. Next if calculates a member_range by dividing limit by members. That line is when this code started to smell fishy to me.

The upto() iterator does all the work. It generates a random number in the member_range, adds that to the index and spits it out, then bumps index by member_range. This happens members times.

I think it's important to think about how this works. Let's assume we want two members with a limit of ten. That's going to give us a member_range of five. So we'll chose a random number between zero and four, add zero to it (index), and spit that out. Then bump index by five and repeat. Put another way, we'll choose a random member of the first half of the range and then a random member from the second half of the range. We may get one and nine or even four and five, but we'll never see seven and eight since they're both in the same half of the range.

That's a lightning quick solution. It solves the quiz example in 26 seconds on my box. However, the real question is, is it correct? Obviously, that's subjective. My opinion is that I can find combinations of the range that it will not choose and I don't consider that to be meeting the requirement of randomly selecting members from the (amended) quiz. Many of us are probably familiar with the famous quote:

"Anyone who considers arithmetical methods of producing random digits is,
of course, in a state of sin." --Von Neumann

I think that's probably more applicable to the above code than it is to random number generation in general.

Let's look at another solution. This one is from Dominik Bathon:

ruby
#!/usr/local/bin/ruby

if $DEBUG
def ptime(evt)
$ptimeg ||= Time.now.to_f
STDERR.puts "#{evt} at #{Time.now.to_f - $ptimeg}"
end
else
def ptime(evt)
# noop
end
end

# the actuall sampling, just store the seen values in a hash and return the
# sorted hash keys
def sample(cnt, lim)
x = {}
tmp = nil

for i in 0...cnt
while x.has_key?(tmp = rand(lim))
end
x[tmp] = true
end
ptime "rand"

x = x.keys.sort
ptime "sort"
x
end

# this is the key to success, but needs lots of ram
GC.disable

ptime "start"

x = sample(cnt=ARGV[0].to_i, ARGV[1].to_i)

# creating the newline string only once saves 5s
nl = "\n"
i = 0
while i+10 <= cnt
# this is saves about 1s
print x[i], nl, x[i+1], nl, x[i+2], nl, x[i+3], nl, x[i+4],
nl, x[i+5], nl, x[i+6], nl, x[i+7], nl, x[i+8], nl, x[i+9], nl
i += 10
end
for j in i...cnt
print x[j], nl
end
ptime "print"

I don't want to focus too much on the details here. The first chunk of code just defines a timing helper method and the last chunk it some heavily optimized printing. The actual solution is the sample() method in the middle. The comment before the method tells you exactly how it works, so I'm not going to insult your intelligence by repeating it. This is a great time to bring up an interesting question that came up on Ruby Talk though:

"How would someone not being aware of the advantages a Hash-lookup gives
(compared to Array-lookup) choose to implement the problem with a Hash?
It is not obvious for inexperienced programmers." --Stefan Mahlitz

I think the keyword in what we're trying to find here is "unique". Whenever I think unique, I start thinking about a Hash. The keys of a Hash are unique, by definition. In order to find out if something is unique in an Array, you've got to walk it. In order to find out if something if something exists in a Hash, you just need to check for that one key. That's slower than a single Array lookup, but much faster than five million Array lookups. Get into the habit of making that quick mental translation: Unique (almost always) means Hash.

Back to Dominik's code.

Again, this code is fast. It solves the quiz example in 40 seconds for me. My main concern is what Dominik reported in the submission message:

"But the "real solution" is GC.disable. That has a downside of course. The
above run of sample.rb needs approx. 400MB of RAM. So, don't try this at
home if you have less than 512MB ;-)" --Dominik Bathon

I'm probably a little old school, but that warning scares me. What if we add a zero to both numbers in the quiz example? If I looked into some code and saw that it was using a lot of memory, I don't think I would try untying the memory safety net with GC.disable(). Clearly this solution works and is very fast for certain samples on certain hardware, but I think it needs some caution tape warning users and apparently Dominik does too.

If you like the Hash idea, here's the cleanest version I saw from Jim Menard with a slight syntax change suggested by Daniel Brockman:

ruby
#!/usr/bin/env ruby

require 'set'

def sample(members, range)
selected = Set.new
members.times {
begin
val = rand(range)
end until selected.add?(val)
}
selected.to_a.sort
end

puts sample(ARGV[0].to_i, ARGV[1].to_i).join("\n")

This solution uses the Ruby set library. (Sets are implemented using Hashes, because of the aforementioned unique behavior.) This version just adds a random choice to the selected set, member times. The add?() method ensures that it was a new member for the set, causing the code to loop until it returns true. The results are then converted to an Array, sorted, and printed.

This solution took two minutes and ten seconds with the quiz example on my box. It still allocates a good sized chunk of memory, but significantly less than Dominik's code (around 150 MB).

The other issue with these kinds of solutions is collisions. In the quiz example the members requested are much less than the limit. However, as those two numbers get closer together, random selections will be repeats a lot more often. That can reverse the nice runtimes of these solutions. Here's a simple script to demonstrate collisions:

ruby
#!/usr/local/bin/ruby -w

members, limit = ARGV.map { |n| Integer(n) }

choices = Hash.new
collisions = 0

until choices.size == members
choice = rand(limit)
if choices.include? choice
collisions += 1
else
choices[rand(limit)] = true
end
end

warn "#{collisions} collisions"

Now watch a few runs of that script

$ ruby collisions.rb 2 10
0 collisions
$ ruby collisions.rb 2 10
1 collisions
$ ruby collisions.rb 2 10
0 collisions
$ ruby collisions.rb 2 10
0 collisions
$ ruby collisions.rb 9 10
25 collisions
$ ruby collisions.rb 9 10
21 collisions
$ ruby collisions.rb 9 10
50 collisions
$ ruby collisions.rb 9_999 10_000
39205236 collisions
$ time ruby collisions.rb 9_999 10_000
35249691 collisions

real 1m13.996s
user 1m13.876s
sys 0m0.084s

As you can see the number of collisions can climb very fast and the runtime is quickly dropping off. Joost Diepenmaat suggested that it might be a good idea to switch algorithms when members > limit / 2. Joost's idea was simply to reverse the above algorithm, looking for what not to include, but here's a different option from Matthew D Moss:

ruby
#!/usr/local/bin/ruby

(k, n) = ARGV.map { |s| s.to_i }
n.times do |i|
r = rand * (n - i)
unless (k < r)
puts i
k -= 1
end
end

This solution does not use memory to store the numbers or even need to call sort() at any point. It works by walking the entire range and randomly selecting which numbers to include. You can see that it selects a random number in what's left of the range on each iteration. When the number of members needed is less than that choice, the code just moves on to the next number. However, when members is equal to that choice, the number is selected (printed) and the remaining members count is dropped. The behavior ensure that we get enough numbers, with each number having an equal chance at selection.

The minus with this one is speed. Walking all those numbers takes time. This algorithm is very similar to my own, used to make the quiz. You saw how high that runtime got. However, the code's not gobbling up memory and it is going to find a proper sample, eventually.

That concludes our tour of the solutions. As you can see, there are many trade-offs to be made over this problem. If you want super speed and can spare the memory, a Hash based solution is a good answer. Otherwise, something like Matthew's solution is probably the best choice. It's fast enough on smaller inputs, it doesn't keep eating memory, and it gives a good random sample.

Many, many thanks to all who decided to jump in on this problem. The solutions and discussion were all vary insightful.

Tomorrow we will find out if heaps grow on trees...