Indexing Schemaless Documents in Mongo

JUN 18, 2012

One of the best parts about mongo is its entirely schemaless nature. We can store any kind of data we’d like without having to worry about where that data comes from or how it should be laid out.

However, it’s sometimes tricky to index that data in a way that will be performant using mongo’s schema-less nature. For data of any reasonable size, querying must hit an index to narrow the search space and to avoid numerous disk seeks if the full dataset exceeds the available RAM.

Ideally our users can upload unforseen types of data to our system, and we can match elements quickly based upon a single index. The primary use case for this sort of thing is if we want to associate arbitrary key/value data with various documents. We then want to query on those keys, values, or both! I’d like to outline a few different approaches for how mongo can be made to query these sorts of user-specified key/value pairs efficiently.

Embedded Documents

The naive approach is to structure the properties we are interested in as fields stored in an embedded document. It’s a simple approach, and one which models our data well.

Suppose we are building a music service and want to build metadata about songs, and then embed those in the song object. Users can upload any sort of metadata about a song, though the application might not know what sorts of information will be stored ahead of time. Our song collection might then contain documents which look something like the following:

{
    title : "Don't stop me now",
    artist : "Queen",
    metadata : {
        genre : "Rock",
        duration : 120,
        bps : 120,
        key : "A",
    }
}

In this case we’ll ensure the following indexes:

'metadata.genre'
'metadata.duration'
'metadata.bps'
'metadata.key'

These could be set as sparse indexes, so that documents without these particular fields in their metadata are not included in the index, further narrowing our search space.

However, the major problem with this is that as users add new types of fields to the metadata, all songs must be traversed for each new index. Since mongo has no idea whether the current documents contain the newly indexed field, it has to look through the entire collection. We have to then be careful about what sorts of keys users can upload, or expect to have long periods of time where new indexes are being added.

Pros:

normal schema
simplicity in update operations

Cons:

need many indexes
indexes are built at runtime (each could build in lots of time)

Simple Multikeying

Fortunately for us, we can take advantage of a mongo feature known as multikeying. Essentially it groups elements in an array as part of the same field, allowing us to query for a document containing any one of these elements in a single query. More importantly, it gives us the ability to index only a single field and be able to query it directly by index!

What’s an example of how this works? The textbook case for multikeying is tagging. If we have a blog post document, the tags can be text fields stored as part of an array for the document:

{
    title : "My first post",
    text  : "Some kind of text",
    tags  : [ "mongodb", "segment.io", "coding" ]
}

It’s then very easy to find all blog posts matching a particular tag. Mongo will attempt to match a multikeyed field by each element in the array:

db.posts.find({ tags: "segment.io" }); // finds all posts tagged segment.io

So, by using the simple mutikeying approach - what do our documents look like now? All of our tags can be stored in the same field, but as an array rather than an embedded document.

{
    title : "Don't stop me now",
    artist : "Queen",
    metadata : [
        { genre : "Rock" },
        { duration : 120 },
        { bps : 120 },
        { key : "A" }
    ]
}

The main advantage here is that we only need a single index now, and there’s no need to index at runtime. Since we can ensure an index before our application begins to store data, mongo can build it over time without needing to do a costly index building operation all at once. Our single index is then { metadata : true }.

But wait - there’s a slight problem with this approach! When examining how our queries are performed, there is a difference between exact queries on multikeyed data vs queries which should match a particular document. Here’s a quick example which should demonstrate the point:

// Direct match, find songs of duration : 120
db.songs.find({ metadata: { duration: 120 } }).explain();
// cursor : "BtreeCursor metadata_1"

// Range match, find songs of duration greater than 120
db.songs
  .find({ metadata: { $elemMatch: { duration: { $gt: 120 } } } })
  .explain();
// cursor : "BasicCursor"

What’s going on here?

Mongo stores its indexes as part of a BTree, where the the values of particular indexed keys are stored as nodes in the tree. However the documents themselves are referenced by leaves of the BTree underneath these value nodes. This means our index can do a decent job of matching directly upon document values in our multikey, but when we try to match for a part or condition of the document, we’re out of luck. Anything but a direct match on an element causes us to revert to the BasicCursor, which scans all of the documents in a collection. To briefly recap:

Pros:

single index for multikey query
indexes for exact query

Cons:

updating is a bit more difficult, but can be done using positional operator
doesn’t work with arbitrary query matching using $elemMatch

Structured Multikeying

How do we deal with this problem? Through a more structured use of multikeys.

We want to have only a single index, but want that index to be powerful enough to match on its fields, no matter how those fields are represented. Ideally we can perform any sort of query on these values and still have it hit the index that we created. Obviously some kinds of queries will not be able to use the index, but hopefully any query which only goes a single level deep in its matching will.

The key to getting around this is to index is to combine the two concepts we’ve already seen, while also throwing in the idea of compound indexes. Compound indexes allow fields which might be queried together to use the same index, as long as both fields are specified in the query.

Using multikeys and compound indexes we can create an array of items which will have keys with a known name but with a variable number of items. To do this, we structure our data as an array of key, value pairs - but now with explicitly named fields for the items in the array.

Now our document structure looks more like this:

{
    title : "Don't stop me now",
    artist : "Queen",
    metadata : [
        { key : "genre", value : "Rock" },
        { key : "duration", value : 120 },
        { key : "bps", value : 120 },
        { key : "key", value : "A" }
    ]
}

Our final index is { 'metadata.key' : true, 'metadata.value' :true }. In the end we only need a single index to do all of the matching queries that we’d like!

Pros:

querying arbitrarily using $elemMatch
single index is needed

Cons:

more data necessary for same representation
updating is more difficult

The obvious downside to this approach is that now our documents look much more complicated and far less straightforward than the simple schema that we started with. Additionally, storing the documents requires more space since the key and valuefield names are repeated with each document.

It isn’t perfect, but it does get us what we wanted: a single index which supports different types of matching queries.

Like what you’ve read? Follow me on twitter or check out what I’m building at Segment.io.

Acknowledgements: Parts of this post were inspired by the MongoDB schema decisions at Chemeo.