Index and Query (#54)

by Lyndon Samson

All this fiddling with bits in the thread "How to get non-unique elements from an array" got me digressing to search engines and indexing.

So if you have for example:

Doc1=The quick brown fox

Doc2=Jumped over the brown dog

Doc3=Cut him to the quick

You can build a table with bit number and word.

1 the
2 quick
3 brown
4 fox
5 jumped
6 over
7 dog
8 cut
9 him
10 to
11 quick

To create indices:

Doc1=00000001111
Doc2=00001110101
Doc3=11110000011

You can very quickly return the Docs that contain 'the' [ Doc1,Doc2,Doc3 ], or brown [ Doc1,Doc2 ] etc.

This week's Ruby Quiz is to write a simple indexer/query system.

[ Note:

In the spirit of that thread, I think part of the quiz should be to solve the indexing problem in the shortest, most elegant, yet fastest way possible. Maybe that goes without saying, but I've seen some pretty long quiz solutions in the past.

--Ryan Leavengood ]


Quiz Summary

This was a fun quiz for me because I really don't know anything about indexing documents. I learned a ton just by reading the varied solutions.

Before we get into the bit magic, check out this simple Hash style solution by Dale Martenson:

ruby
class IndexHash
def initialize( documents=nil )
@index = Hash.new( [] )
input( documents ) if documents
end

def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word) }
end
end

def insert( document_symbol, word )
w = word.downcase
unless @index[w].include?( document_symbol )
@index[w] += [ document_symbol ]
end
end

def find( *strings )
result = []
strings.each do |string|
string.split.each do |word|
result += @index[ word.downcase ]
end
end
result.uniq
end

def words
@index.keys.sort
end
end

We see in initialize() that the index is just a Hash. The input() method is also easy to understand, as it just adds each word of a document to the Hash, as an Array of symbolic document names where the word can be found (see insert()).

The other side of the equation is find(). It takes an Array of Strings, dices those up into words, combs the index for each word, and adds all the connected document symbols to the result Array it returns.

David Balmain has done some timing of the submitted solutions:

Benchmarks

This isn't the fastest solution for searches, but it still helped me understand what we are shooting for here.

Now let's take it a step further and look at Bob Showalter's bit solution:

ruby
#!/usr/local/bin/ruby

# document indexing/searching class
class Index

# default index file name
INDEX_FILE = 'index.dat'

# loads existing index file, if any
def initialize(index_file = INDEX_FILE)
@terms = {}
@index = {}
@index_file = index_file
if File.exists? @index_file
@terms, @index = Marshal.load(
File.open(@index_file, 'rb') {|f| f.read})
end
end

# ...

We immediately see that this code has two Hashes, one for the terms and one for the index. We can also see that Marshal is used to save and load these Hashes.

Now let's look at the methods that add to the index:

ruby
# ...

# sets the current document being indexed
def document=(name)
@document = name
end

# adds given term to the index under the current document
def <<(term)
raise "No document defined" unless defined? @document
unless @terms.include? term
@terms[term] = @terms.length
end
i = @terms[term]
@index[@document] ||= 0
@index[@document] |= 1 << i
end

# ...

The first method just sets the name of the document we are currently dealing with. The second adds a single term to the index, under that document name.

The first step in adding a term is to place it in the terms Hash. The key is the term and the value is the number of pairs already in the Hash (basically a numerical index). Why didn't Bob just use an Array here? Because it would slow down lookups. You would have to walk the Array to find the term in question each time you needed to know its index.

Once you have an index for the new term, it's time to record it in the index Hash, under the current document name. Each document name is given a single Integer for a value. The bit at the term index is then just flipped on to indicate the presence of a term. This will make for some big numbers, but remember that Ruby will automatically switch to Bignum as needed.

Now we need the tools to get the info back out:

ruby
# ...

# finds documents containing all of the specified terms.
# if a block is given, each document is supplied to the
# block, and nil is returned. Otherwise, an array of
# documents is returned.
def find(*terms)
results = []
@index.each do |document, mask|
if terms.all? { |term| @terms[term] && mask[@terms[term]] != 0 }
block_given? ? yield(document) : results << document
end
end
block_given? ? nil : results
end

# dumps the entire index, showing each term and the documents
# containing that term
def dump
@terms.sort.each do |term, value|
puts "#{term}:"
@index.sort.each do |document, mask|
puts " #{document}" if mask[@terms[term]] != 0
end
end
end

# saves the index data to disk
def save
File.open(@index_file, 'wb') do |f|
Marshal.dump([@terms, @index], f)
end
end

end

# ...

Again, find() is our primary search method. It walks the document listing, checking for any document containing all the terms. (That's different from the first solution we looked at which found documents containing any terms.) A term is found, simply by checking to see if a given bit is on. Ruby makes this easy since the indexing method, [](), returns bits for Integers. If you provided a block to find(), each document is passed when found. Otherwise, find() collects and returns an Array of the documents.

Both dump() and save() are obvious and do exactly what the comments say they do.

Here's the last bit of code:

ruby
# ...
if $0 == __FILE__
idx = Index.new
case ARGV.shift
when 'add'
ARGV.each do |fname|
idx.document = fname
IO.foreach(fname) do |line|
line.downcase.scan(/\w+/) { |term| idx << term }
end
end
idx.save
when 'find'
idx.find(*ARGV.collect { |s| s.downcase }) do |document|
puts document
end
when 'dump'
idx.dump
else
print <<-EOS
Usage: #$0 add file [file...] Adds files to index
#$0 find term [term...] Lists files containing all term(s)
#$0 dump Dumps raw index data
EOS

end
end

That's a clever interface in very little code. It reads the first argument to see if you want to "add", "find", or "dump" (similar to svn, cvs, or gem). The rest of the arguments are the files to "add" or the terms to "find".

The other interesting element at David Balmain's result comparison page, is the chart of capabilities. Think about how you might support queries like "Apple AND NOT fruit". See David's Simple Ferret solution for a start on this kind of logic.

Then think about how you might add the ability to search for phrases instead of terms, like "Ruby Programming Language". This problem space is vast and interesting to explore, I think.

My thanks to everyone who worked the quiz and to David Balmain for results that helped me know what to even look at.

Tomorrow we will explore James's favorite "trick taking game"...