Reverse the Polarity (#143)

by Benjohn Barnes

Usually, you write a regular expression and then search for it in a text string. How about doing the opposite? Given a regular expression, generate all of the strings that it will match.

Rather amazingly, this has independently come up a couple of times in conversations with friends, so perhaps it's actually useful. I wanted to use it to specify possible domain names for a potential new company...

/(lovely|delicious|splendid)(food|snacks|munchies)/.generate
=> [lovelyfood, deliciousfood, splendidfood,
lovelysnacks, delicioussnacks, splendidsnacks,
lovelymunchies, deliciousmunchies, splendidmunchies]

The full regular expression grammar is pretty complex, so maybe you'll want to just go with a small subset. :-)


Quiz Summary

I was glad to see this problem submitted. I worked on it a ways back when I was translating examples from Higher-Order Perl. It shows up in the book during the discussion of infinite streams.

Watching others solve it was great fun because I realized that Ruby Quiz fans are crazy. A couple of you implemented an almost scary amount of the regular expression syntax.

Probably the craziest solution comes from Jesús who even shoehorned in backreferences. Let's take a look at parts of that lengthy solution below.

First, some limitations:

ruby
# Number of times to repeat for Star and Plus repeaters
TIMES = 2

# Set of chars for Dot and negated [^] char groups
#CHARS = [("a".."z").to_a, ("A".."Z").to_a, ".", ",", ";"].flatten
CHARS = %w{a b c d e}

# ...

Many regular expressions can match an infinite number of Strings. Consider the trivial expression /a+/. It matches "a", "aa", "aaa", etc. There are two ways to handle this practically: limit the repetition or provide an infinite series of matches user code can explore and stop as needed. Jesús did the former.

These variables control how much repetition is done and what character set is used for constructs that match any character or any character not included in a listed subset.

If your are curious to see the infinite streams approach, I wrote about it as part of this blog post:

Infinite Streams

That solution doesn't include the regular expression parsing code though.

Getting back to Jesús's code, the next section defines several classes in the categories of repeaters and groups. Groups are pieces of a matched String that can be generated and repeaters control how many times those groups appear. Let's begin with a trivial repeater that doesn't really repeat:

ruby
# ...

class OneTimeRepeater
def initialize(group)
@group = group
end

def result
@group.result
end
end

# ...

As, you can see, this repeater just wraps some group. The result() of this repeater is the result of that group, one time. Obviously this is used for things that don't repeat.

The next repeater should be plenty familiar to regular expression fans though:

ruby
# ...

class StarRepeater
def initialize(group)
@group = group
end

def result
r = []
group_res = @group.result
group_res.unshift("")
TIMES.times do
r << group_res
end
combine(r).uniq
end
end

# ...

This would be the repeater used to handle expressions like /a*/. The result of this repeater is zero or more occurrences of the group it was passed. Practically speaking here that means between one and TIMES copies of the group or the empty String to represent zero. We see this collected into an Array here and passed on to combine() for generation.

The PlusRepeater, QuestionMarkRepeater and RangeRepeater classes are constructed similarly. In the interests of space and time, I'm not going to show those here.

Now we are ready for the groups. Again we begin with the trivial cases:

ruby
# ...

class SingleChar
def initialize(c)
@c = c
end
def result
[@c]
end
end

# ...

class Dot
def result
CHARS
end
end

# ...

Here we see groups that echo a character and represent a special range of characters as their results. Obviously the first is used for literal chunks of an expression, like /a/, and the second is used for the special regular expression character /./.

Let's examine the group used to represent alternations like /a|b/:

ruby
# ...

class OrGroup
def initialize(first_groupset, second_groupset)
@first = first_groupset
@second = second_groupset
end

def result
strings = @first.map {|x| x.result}
s = combine(strings)
strings = @second.map {|x| x.result}
s.concat(combine(strings))
end
end

# ...

Since the result()s of a group are just the possible generations of that group, alternation can be represented as the left choices plus the right choices. That's what we see here.

Jesús's code includes other groups: CharGroup for character classes, MultiGroup for nesting groups and managing reference counts, and BackReference for supporting the regular expression feature of the same name. I'm going to skip over these too, to keep the size of this summary reasonable.

Now let's have a look at the combine() method I've glossed over twice now:

ruby
# ...

# Combines arrays, concatenating each string
# merging the possible groups they have
# Starts combining the first two arrays, then goes on
# combining each other array to the result of the
# previous combination
def combine(arrays)
string = arrays.inject do |r, rep|
temp = []
r.each {|aa| rep.each {|bb| temp << (aa.concat_and_merge_groups(bb))}}
temp
end
string
end

class String
attr_accessor :filled_groups

def add_filled_group(num, group)
@filled_groups ||= {}
@filled_groups[num] = group
end

def concat_and_merge_groups(other)
temp = self + other
groups = {}
groups.merge!(self.filled_groups) if self.filled_groups
groups.merge!(other.filled_groups) if other.filled_groups
temp.filled_groups = groups
temp
end

end

# ...

The combine() method just turns an Array of group result()s into the actual combinations of Strings. So given [%w[a b], %w[c d]] it outputs %w[ac ad bc bd].

The String additions here that are tracking groups were added to support backreferences. This was a challenging feature to support and it required hooking into the code at many levels. I don't want to focus on it too much though since it wasn't a critical part of the quiz so much as an impressive extra Jesús managed to include.

The next method we need to look at is the parser. I'm going to trim some of it because it's quite long, but you should still get to see how it's put together:

ruby
# ...

class Regexp
attr_reader :num_groups
def parse(s, i = 0)
repeaters = []
group = nil
while i < s.length
char = s[i].chr
case char
# ...
when '.'
group = Dot.new
when '|'
groups, i = parse(s, i + 1)
group = OrGroup.new(repeaters, groups)
return [group], i
# ...
else
group = SingleChar.new(char)
end

repeater = nil
i += 1
if i < s.length
case s[i].chr
when '*'
repeater = StarRepeater.new(group)
# ...
else
repeater = OneTimeRepeater.new(group)
i -= 1
end
i += 1
else
repeater = OneTimeRepeater.new(group)
end
repeaters << repeater
end
return repeaters, i
end

# ...

As you can see, this is a recursive character by character parser. It reads through the expression finding groups and wrapping those in repeaters, with whatever level of nesting is required. This builds up an Abstract Syntax Tree for the expression.

Let's see how this gets put to use to create the quiz solution method:

ruby
# ...

def generate
@num_groups = 0
r = self.inspect[1..-2]
repeaters, _ = self.parse(r)
strings = repeaters.map {|x| x.result}
s = combine(strings)
# Makes a pass for the backreferences
s.each do |string|
string.gsub!(/__(\d+)__/) do |match|
string.filled_groups[$1.to_i]
end
end
s
end
end

The process here is very simple: pull the source of the expression, parse it into the AST, generate the result() of the AST, and combine() those Strings into the needed Array of matches. Again, the rest of the code here is used to substitute backreferences back into the end results.

Jesús also included another method that pretty-printed and verified results, but I won't go into that here.

As usual I don't want you to miss out on the other solutions. James Koppel used currying to build up an AST of functions. Vasil Vangelovski sent in a slow but unique approach that works on all regular expressions without even parsing them. Do take the time to inspect the solutions. It's worth it.

My thanks to all who poured so much effort into this quiz. You are all fearless.

Tomorrow we will take a stab at slicing up time...