An in-place parsing experiment

Instapaper Text

An in-place parsing experiment

Our engine recently ran into an out-of-memory crash when parsing a 150 MB JSON file (foiled by the artists again). The file caused our parser to create a DOM-like tree representation that was over 2 GB in size. No worries, I put the DOM tree on a strict memory diet and reduced the size to 120 MB — slightly less than the size of the original JSON data.

But it made me think of something I’ve always wanted to try: in-place parsing.

In-place parsing

The idea behind in-place parsing is simple. Instead of copying the entire JSON document into a DOM tree representation, we let the data in the JSON file represent itself. We represent an object simply by a pointer to the particular position in the JSON data where that object starts. And then we lazily parse out the properties as we need them.

Let’s look at an example:

{
 
     "people" : [
 
        {"name" : "Niklas", "occupation" : "programmer"},
 
        {"name" : "Obama", "occupation" : "president"}
 
     ]
 
  }

If I wanted to represent the first person in this list, I would use a pointer into the JSON string where that object starts:

{"name" : "Niklas", "occupation" : "programmer"},  {"name" : "Obama", ...

Similarly, if I wanted to talk about my name I would use a pointer to where that string starts:

"Niklas", "occupation" : "programmer"},  {"name" : "Obama", "occupatio...

An in-place parsing API provides functions for parsing values, given pointers to the values’ locations in the JSON document. I use something like this:

enum ValueType {NIL, BOOL, INTEGER, FLOAT, STRING, ARRAY, OBJECT};
 
  ValueType type(const char *s);
 
  bool to_bool(const char *s);
 
  int to_int(const char *s);
 
  float to_float(const char *s);
 
  void to_string(const char *s, Vector<char> &str);
 
  void to_array(const char *s, Vector<const char *> &array);
 
  void to_object(const char *s, Map<DynamicString, const char *> &object);

As you can see from this interface, when parsing a JSON array or object I do not recursively parse all sub objects (doing that would be the same as creating a DOM tree). Instead I represent the object as a map from a string (the property name) to the location in the JSON file where the value of that property is found.

When I need to recurse into the sub-objects I do further parsing of their const char * pointers.

So the people array in the example above will be represented as an array of string pointers to the locations of the objects in the array. Something like this:

 
  [0]: '{"name" : "Niklas", "occu...'
 
  [1]: '{"name" : "Obama", "occup...'

The appealing thing about this approach is that it allocates almost no memory in addition to what the original JSON file uses. We only need a few bytes for the Vectors and Maps of the objects we are currently parsing.

This means that an in-place parser can handle very large files. A 2 GB JSON file is no problem, we can just mmap() it and run the in-place parser.

Since the bulk of the data is in the immutable JSON file, in-place parsers are also easy to multi-thread. Different threads can parse different branches of the JSON file without stepping on each other’s toes.

But what about speed? Since we avoid the busy work of allocating and shuffling nodes in a DOM tree, we might be faster than a DOM parser. On the other hand, we have to make multiple passes through the file. First to get the root object, then a second pass when we parse out its properties, etc.

Performance analysis

How bad are the multiple passes of the in-place parser? Let’s see how much memory they touch all together.

In the first pass we parse the entire file to find the keys and values of the root object. The cost of this is about $N$ is the size of the JSON file.

In the second pass we parse the objects at the next level of nesting to find the location of their keys and value. The cost of this step is $c_1$ is the fraction of the file covered by objects at this level — it is probably a value quite close to 100 %.

Continuing, we find the total parse cost to be:

$O(N + c_1 N + c_2 N + c_3 N + ...)$

$O((1 + c_1 + c_2 + c_3 + ...) N)$

$O(kN), k = (1 + c_1 + c_2 + c_3 + ...)$

Where $k$ is the average “depth” or “level of nesting” of data in the file.

Let’s compare this with how much memory the DOM parser touches. The DOM parser touches the $O(4N)$ .

For a typical file that is not very deeply nested you would expect $O(4N)$ are in the same ballpark. The devil then is in the constants. Does the tree building and node shuffling of the DOM method make it more expensive, or is the extra parsing performed by the in-place method more expensive?

A test run

To test this, I ran both parsers on the the 150 MB JSON file that started all this. I created a lightweight “helper interface” for the in-place parser that gave it the same interface as our DOM nodes, so that I could switch parsers without modifying the data compiler.

The result:

 
  Original DOM-parsing:   4420.2 ms
 
  In-place parsing:       5073.5 ms

Pretty much the same, in other words. Damn, that’s disappointing! So while the in-place method saves a lot of memory, that save does not translate to an immediate speed increase. At least not after the memory optimizations I made on the DOM tree.

The test JSON file has the bulk of its data in nodes with the path:

root.animations[i].stream.data[t]

So $k \approx 5$ in this case.

From here, I could go on and try various optimizations to improve the in-place parser, but to be fair I would have to spend an equal ammount of time optimizing the tree parser, because neither has been extensively low-level optimized.

But frankly, this problem does not warrant that much attention. JSON files are never parsed by our runtime, only in our data compile step and big files are rare, so this has no huge effect on our end users. I was mostly curious in seeing if moving to an in-place parser would give an order-of-magnitude improvement. Seems it doesn’t, at least not with the approach I have been using.

#AltDevBlogADay

Niklas Frykholm