Parsing JSON (#155)

There has been a lot of talk recently about parsing with Ruby. We're seeing some parser generator libraries pop up that make the task that much easier and they've been stirring up interest.

In honor of that, this week's Ruby Quiz is to write a parser for JSON.

JSON turns out to turns out to be a great little example for writing parsers for two reasons. First, it's pretty easy stuff. You can hand-roll a JSON parser in under 100 lines of Ruby. The second advantage is that the data format is wonderfully documented:

JSON

Since JSON is just a data format and Ruby supports all of the data types, I vote we just use Ruby itself as the abstract syntax tree produced by the parse.

Feel free to show off your favorite parser generator, if you don't want to roll your own. Anything goes.

Here are a few tests to get you started:

ruby
require "test/unit"

class TestJSONParser < Test::Unit::TestCase
def setup
@parser = JSONParser.new
end

def test_keyword_parsing
assert_equal(true, @parser.parse("true"))
assert_equal(false, @parser.parse("false"))
assert_equal(nil, @parser.parse("null"))
end

def test_number_parsing
assert_equal(42, @parser.parse("42"))
assert_equal(-13, @parser.parse("-13"))
assert_equal(3.1415, @parser.parse("3.1415"))
assert_equal(-0.01, @parser.parse("-0.01"))

assert_equal(0.2e1, @parser.parse("0.2e1"))
assert_equal(0.2e+1, @parser.parse("0.2e+1"))
assert_equal(0.2e-1, @parser.parse("0.2e-1"))
assert_equal(0.2E1, @parser.parse("0.2e1"))
end

def test_string_parsing
assert_equal(String.new, @parser.parse(%Q{""}))
assert_equal("JSON", @parser.parse(%Q{"JSON"}))

assert_equal( %Q{nested "quotes"},
@parser.parse('"nested \"quotes\""') )
assert_equal("\n", @parser.parse(%Q{"\\n"}))
assert_equal( "a",
@parser.parse(%Q{"\\u#{"%04X" % ?a}"}) )
end

def test_array_parsing
assert_equal(Array.new, @parser.parse(%Q{[]}))
assert_equal( ["JSON", 3.1415, true],
@parser.parse(%Q{["JSON", 3.1415, true]}) )
assert_equal([1, [2, [3]]], @parser.parse(%Q{[1, [2, [3]]]}))
end

def test_object_parsing
assert_equal(Hash.new, @parser.parse(%Q{{}}))
assert_equal( {"JSON" => 3.1415, "data" => true},
@parser.parse(%Q{{"JSON": 3.1415, "data": true}}) )
assert_equal( { "Array" => [1, 2, 3],
"Object" => {"nested" => "objects"} },
@parser.parse(<<-END_OBJECT) )
{"Array": [1, 2, 3], "Object": {"nested": "objects"}}
END_OBJECT

end

def test_parse_errors
assert_raise(RuntimeError) { @parser.parse("{") }
assert_raise(RuntimeError) { @parser.parse(%q{{"key": true false}}) }

assert_raise(RuntimeError) { @parser.parse("[") }
assert_raise(RuntimeError) { @parser.parse("[1,,2]") }

assert_raise(RuntimeError) { @parser.parse(%Q{"}) }
assert_raise(RuntimeError) { @parser.parse(%Q{"\\i"}) }

assert_raise(RuntimeError) { @parser.parse("$1,000") }
assert_raise(RuntimeError) { @parser.parse("1_000") }
assert_raise(RuntimeError) { @parser.parse("1K") }

assert_raise(RuntimeError) { @parser.parse("unknown") }
end
end

Quiz Summary

We saw a large variety of solutions for this week's problem. Many of them used a parser generator to construct their parser. You do that by defining a grammar that describes the syntax you need to read. The parser generator then translates your grammar into parsing code that will match the described syntax. Of the generators used, Treetop was definitely the most popular and is surely worth a look if you want to do some grammar based parsing.

I'm not going to show a grammar based solution here, not because I don't like them, but because I want to show a few of the ideas behind simple parsing. To do that, we will need to examine a hand-rolled solution. Just to be clear though, using grammar based parsers can often be a more robust choice if you can find an official grammar for the content you need to parse.

Eric Mahurin sent in a hand-rolled solution that has quite a few advantages. First, it is trivial to adapt so that the entire content to be parsed doesn't need to be read into memory. It's also some very efficient code. Unfortunately, that makes it a touch less obvious if you aren't already familiar with parsing techniques.

Given that, I'm going to show my own hand-rolled recursive descent parser. It's not as cool as Eric's but it's a little easier to take in. We will say it's a good introduction to Eric's code, which you should be able to follow after I break this down.

Here's the beginning of my parser:

ruby
#!/usr/bin/env ruby -wKU

require "strscan"

class JSONParser
AST = Struct.new(:value)

def parse(input)
@input = StringScanner.new(input)
parse_value.value
ensure
@input.eos? or error("Unexpected data")
end

private

# ...

One of the first concerns when parsing is the need to manage where you currently are in the input. If you treat the input as an IO object, you can read input piece by piece and the IO object itself naturally keeps track of where you are. For String input though, it's often easier to use Ruby's standard StringScanner class. It wraps the input and allows you to test regular expression matches against the head of that input. It will tell you when they match or don't and when they do, your position automatically advances forward beyond that match. You can see that I set this up in the code above.

The only public method for my class is parse(). It prepares the StringScanner as I just described, tries to parse a JSON value out of the input, then makes sure we consumed all of the input. Note that my use of ensure here isn't very standard. I'm just using it to run some code at the end of the method without changing the return value of the method.

The AST definition also merits a bit of discussion. It would have been nice to just have each method return the Ruby objects for the JSON it parsed. However, false and nil (null in JSON) are legal JSON parses in some places. If I return those, I won't be able to tell if my parse succeeded or failed. To get around that, all parsed JSON values are wrapped in a trivial abstract syntax tree object. Returning this object is always true and, after I've verified that the parse worked, it's just one more method call to retrieve the actual value it parsed.

It's worth mentioning that this parser is based on the not quite correct definition of JSON I put forth in the quiz tests. Only objects and arrays are allowed to be top-level JSON values. An easy fix is to replace this line

ruby
# ...

parse_value.value

# ...

with:

ruby
# ...

if top_level = parse_object || parse_array
top_level.value
else
error("Illegal top-level JSON object")
end

# ...

Now let's look at the main parser:

ruby
# ...

def parse_value
trim_space
parse_object or
parse_array or
parse_string or
parse_number or
parse_keyword or
error("Illegal JSON value")
ensure
trim_space
end

# ...

This method really sums up the basic strategy of recursive descent parsing. At each point, try to read out one of the legal values that can occur there. You can do that just by drilling down into more specialized methods that know how to read that one thing. If at any time you can't read a legal value, you have an error.

Let's dig into the specialized parsers a bit more to see how this works:

ruby
# ...

def parse_object
if @input.scan(/\{\s*/)
object = Hash.new
more_pairs = false
while key = parse_string
@input.scan(/\s*:\s*/) or error("Expecting object separator")
object[key.value] = parse_value.value
more_pairs = @input.scan(/\s*,\s*/) or break
end
error("Missing object pair") if more_pairs
@input.scan(/\s*\}/) or error("Unclosed object")
AST.new(object)
else
false
end
end

# ...

This code reads JSON objects. It's pretty linear, so let's digest it in order. First, we have to have an opening brace or we don't have an object at all. We can see here that I try a regular expression on the StringScanner to see if that is indeed what's up next. If it is scan() will return true and @input will advance past that brace for our future matches. If it's false, we can't read an object and the whole attempt is a bust.

When we know we're inside an object, we create the Ruby equivalent (a Hash), fill it with all of the string/value pairs we can read, then make sure we find a closing brace. Reading the pairs is the trickiest part because we have to match a string, followed by a colon, and finally a value. Then, if we find a comma, we know another pair is expected. If not, we've read the whole object. Note that I verify these assumptions at every step and toss error()s if any of them fail. For parsing strings and values, we just reuse the parse_string() method we first saw called in parse_value() and parse_value() itself.

You can see that I'm constantly trimming space around the JSON syntax. I could have also done that with repeated calls to the trim_space() helper we saw used in parse_value(), but that fattens up the code a bit and slows things down with more tests. For these simple cases, I opted for the shortcut.

Having deciphered parse_object(), parse_array() is trivial:

ruby
# ...

def parse_array
if @input.scan(/\[\s*/)
array = Array.new
more_values = false
while contents = parse_value rescue nil
array << contents.value
more_values = @input.scan(/\s*,\s*/) or break
end
error("Missing value") if more_values
@input.scan(/\s*\]/) or error("Unclosed array")
AST.new(array)
else
false
end
end

# ...

This is identical to the process we just examined save that it's pulling individual values in the middle instead of string/value pairs. This simplifies the code a bit, as you can see. We also throw these objects into a Ruby Array instead of a Hash.

Now we are ready for parse_string() and it has a couple of helpers:

ruby
# ...

def parse_string
if @input.scan(/"/)
string = String.new
while contents = parse_string_content || parse_string_escape
string << contents.value
end
@input.scan(/"/) or error("Unclosed string")
AST.new(string)
else
false
end
end

def parse_string_content
@input.scan(/[^\\"]+/) and AST.new(@input.matched)
end

def parse_string_escape
if @input.scan(%r{\\["\\/]})
AST.new(@input.matched[-1])
elsif @input.scan(/\\[bfnrt]/)
AST.new(eval(%Q{"#{@input.matched}"}))
elsif @input.scan(/\\u[0-9a-fA-F]{4}/)
AST.new([Integer("0x#{@input.matched[2..-1]}")].pack("U"))
else
false
end
end

# ...

Whenever a structure you need to read gets more complicated, you want to break it down into smaller parsers that read more specialized pieces. Some probably would have broken down the the string/value pairs from object (into a parse_object_pair()), but you don't gain much for that and it was just simple enough that I opted for the easier approach. Here though we need to handle content and escapes differently and there may be any combination of them in any order inside the string. Now the gain is worth it.

Content is easy enough to handle, since we can pass it through unaltered. It's already content in a Ruby String object. Escapes we have to work on a little more. Some we just unescape, but others need to be converted. I used pack() to handle Unicode, but you can see that I was lazy and used eval() on the special string escapes. All of these have the same meaning in Ruby and thanks to the match I know it's safe to eval() the contents without worrying about embedded Ruby nastiness.

With those defined, parse_string() is similar to parse_array(). Find the start of a JSON string, pull content and escapes as long as we can, then find the end of the string.

The last two parsers are the easiest of all:

ruby
# ...

def parse_number
@input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\b/) and
AST.new(eval(@input.matched))
end

def parse_keyword
@input.scan(/\b(?:true|false|null)\b/) and
AST.new(eval(@input.matched.sub("null", "nil")))
end

# ...

These are just match and eval() as you can plainly see. Again the evals() are safe because the match ensures we aren't facing any dangerous content.

Some feel that using regular expressions like this isn't true parsing. We certainly could have chopped the number rule down into a bunch of smaller rules. However, the number definition is squarely in the domain of what regular expressions do well and I'm more of a practical kind of guy. I have access to regular expressions with this setup, the needed expression isn't really all that complex, and I like easy jobs. Thus I use them.

All that is left are the two helpers I used, though the implementations won't be any surprise:

ruby
# ...

def trim_space
@input.scan(/\s+/)
end

def error(message)
if @input.eos?
raise "Unexpected end of input."
else
raise "#{message}: #{@input.peek(@input.string.length)}"
end
end
end

First, trim_space() can just try a match to advance us pass any whitespace. It may fail, because there wasn't any whitespace to skip, but that doesn't affect us any. We know that we aren't about to read whitespace after it is called, either way.

My error() wrapper just raise()s exceptions. It adds the content left to parse so you can see where you had trouble or replaces the message altogether to warn you that all of the content was exhausted.

That's all it takes to build a simple JSON parser. Take some time to go look through the other hand-rolled solutions now and I bet you'll be surprised by how similar they work. Then you can look into grammars and how they simplify the process of defining new grammars.

The final Ruby Quiz will take us back into the world of finance...