Longest Repeated Substring (#153)

This week's Ruby Quiz is to write a script that finds the longest repeated substring in a given text.

Your program will be passed some text on STDIN and is expected to print the longest repeated substring within that text to STDOUT.

Repeated substrings may not overlap. If more than one substring is repeated with the same length, you may print any of them. If there is no repeated substring, the result is an empty string (print nothing).

Example:

$ echo banana | ruby longest_repeated_substring.rb
an

OR

$ echo banana | ruby longest_repeated_substring.rb
na

Make sure your code runs efficiently when passed a large text.


Quiz Summary

I first saw this problem years ago. It was one of the Perl's Quiz of the Week problems. I remember being surprised at how the solution was sort of counter-intuitive to me. I mean that my first thought was: start at half the string size, try to find a string of that length that occurs twice in the full text, reduce the length by one character and recheck until you find a match. Yoan Blanc coded that strategy up for us:

ruby
text = STDIN.read

(text.length/2).downto 1 do |l|
match = Regexp.new("(.{#{l}})\\1").match(text)
if match
puts text[match.offset(1)[0]..(match.offset(1)[1]-1)]
break
end
end

Technically the regular expression engine will choke on this solution for a significantly large text. The reason is that quantifiers in {…} limits are only allowed to be so large. The theory is as I described though.

However, even if it could match any size, this solution turns out to be far too slow to be practical. The reason is that given a megabyte or more of text, the longest repeated substring is typically less than 200 bytes. That means you do a whole lot of wasted checks before you are even in the range that a match can occur.

Yoan realized this and submitted a second solution that flipped the search around. Starting with a match of zero characters and finding the next biggest puts you in the right neighborhood from the get go.

The other key realized in Yoan's second solution is that if we know there is a four character repeated substring at some index, we can also count on the fact that there are one, two, and three character repeated substrings at the same index. Given that, you can keep checking at the same index until you don't find a match for a given size. The size found just before that was the biggest at that index.

The Perl guys found further means to optimize this approach, by trimming the search space. However, this too breaks down in the face of significantly large inputs. That's why we have suffix trees.

The idea of a suffix tree is to build a list of every suffix that occurs in the text. Thus the quiz example "banana" breaks down to:

banana
anana
nana
ana
na
a

That may not look like a lot of help yet, but watch what happens if we sort the entries:

a
ana
anana
banana
na
nana

See how the common prefixes group together? We can now compare adjacent entries of our suffix tree for prefixes they have in common. In this case, the second and third element share an "an" and that's one of the possible answers. Note that the second and third share "ana" as well, but selecting that one causes overlap:

[ana]na
[ana]

There's a catch though. Consider this example input from Eric I.: ananana. The suffix tree is:

ananana
nanana
anana
nana
ana
na
a

That sorts into:

a
ana
anana
ananana
na
nana
nanana

In this case comparing just the second and third entries gives us "an", because "ana" would overlap. But, if you compare the second and fourth entries, "ana" becomes a valid answer. That tells us that it's sometimes necessary to look ahead more than just one step.

Enough explaining, let's read some code. I'm going to show Eric's solution below, because it was very well commented and that helped me understand it. However, I'm going to remove those comment in the following listings in the interests of space. I also made a couple of tiny edits that I felt simplified the code a touch.

First we have a simple helper method:

ruby
def longest_common_prefix(s1, s2, max = nil)
min = [s1.size, s2.size, max].compact.min
min.times do |i|
return s1.slice(0, i) if s1[i] != s2[i]
end
return s1.slice(0, min)
end

# ...

Given two Strings, this code will find the longest prefix they share. It begins by finding the smallest length among the passed Strings and an optional provided maximum. It then walks those indexes looking for mismatched characters.

When it finds a mismatch, it returns what came before. How it does that is a little clever though, since it seems to use the same numbers. Consider that we are comparing "abc" and "abd" on the third character. That means i = 2 and s1[i] != s2[i]. That will cause the code to return s1.slice(0, i), but remember that i is used as a length there, not an index. That's why we get the first two characters back ("ab").

If no mismatches arise during the search, they match all the way to our limit and that is returned.

The next method is where all the action is and it's a doozy, so we will take it in pieces. The first chunk builds and sorts the suffix tree:

ruby
# ...

def longest_repeated_substring(string)
size = string.length

suffixes = Array.new(size)
size.times do |i|
suffixes[i] = string.slice(i, size)
end

suffixes.sort!

# ...

This code might seem wildly inefficient (memory-wise) at first glance, but Ruby will surprise you here. When you slice() a substring like this, Ruby cheats and doesn't actually make a copy of the text. Instead, the new object is a pointer into the old object. Now, if you change either String down the road, Ruby must and will finish the job, turning it into a full copy. We call this behavior "Copy on Write." Since we won't be changing these Strings though, just examining them, we're safe with the tiny pointers for the duration.

Note that the size variable is always used as the substring length here, though it's too long in all but the first case. Ruby will stop at the end of the String even if you provide a longer length, so it does the same thing and avoids unneeded math or Range object construction.

OK, let's begin the search through the tree:

ruby
# ...

best = ""
at_least_size = 1 # the size to meet or exceed to be the new best
distance = nil
neighbors_to_check = 1

(1...size).each do |i|
s1 = suffixes[i]

neighbors_to_check.downto(1) do |neighbor|
s2 = suffixes[i - neighbor]

distance = (s1.size - s2.size).abs
if distance < at_least_size
if s1.size >= at_least_size &&
s2.size >= at_least_size &&
s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
neighbors_to_check = [neighbors_to_check, neighbor + 1].max
else
neighbors_to_check = neighbor
end
next
end

# ...

First, some variables are set to hold the best match so far, the size a match would need to be to beat it, the distance between the substrings, and how far down the tree we need to search. After, that we enter a simple loop trying each suffix.

The neighbors_to_check iteration is our look ahead for similar suffixes that share our prefix. We usually only need to look one step ahead, but these checks watch for overlap cases and extends our search when needed.

Note that we figure a distance between substrings here, even though we didn't remember where they started. That's possible because all suffixes are anchored to the far edge of our original text. Knowing that, comparing lengths is the same as comparing starting indexes.

A distance below our needed match size warns us that there is overlap and we may need to look further ahead in this case. The if conditions just ensure the Strings are of the length we should even consider and that they match at least that much. If all of that turns out to be true, our search ahead is extended.

Now we are ready to do the actual comparisons:

ruby
# ...

unless s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
neighbors_to_check = neighbor
next
end

best = longest_common_prefix(s1, s2, distance)
at_least_size = best.size + 1
if best.size == distance
neighbors_to_check = [neighbors_to_check, neighbor + 1].max
else
neighbors_to_check = neighbor
end
end
end

best
end

# ...

The unless check we see here is just a short circuit optimization. There's no need to check for common prefixes if the Strings aren't a match at least as long as our current best.

If we've made it this far, we know the current two Strings share a prefix that meets or exceeds our current best. A hand-off is made to longest_common_prefix() to grab our new best and the rest of the iterator resets the search parameters.

The best we found in the entire search is then returned as the solution.

Here's the interface code that wraps this method:

ruby
# ...

if $0 == __FILE__
string = nil
if ARGV[0] == "-f"
string = File.read(ARGV[1])
elsif ARGV.size == 0
string = STDIN.read
elsif ARGV[0] =~ /^-/ || ARGV.size > 1
STDERR.puts "usage:"
STDERR.puts " #{$0} (note: input comes from standard input)"
STDERR.puts " #{$0} string"
STDERR.puts " #{$0} -f filename"
exit
else
string = ARGV[0]
end

result = longest_repeated_substring(string)
puts %Q{"#{result}" (#{result.length} characters)}
end

Most of the above code just sorts out the command-line arguments. The checks determine whether to read the text from a file, STDIN, or the arguments themselves.

The last two lines call the workhorse method we just finished examining and print details about the best match found.

My thanks to all who stretched my brain with all of the excellent discussion about this week's problem.

Tomorrow we will try Ruby's hand at coin counting...