As with most companies that approach the 100 million-row count in MySQL, we’re starting to hit growing pains. We’re also a bit abnormal in our DB needs – our load is about 45% write. This is rather unfortunate for us – read replicas are way less useful for our particular needs, especially as you still have to have a single write-master. We realize there are master-master setups, but the complications are too much to deal with.
So we’ve been adding RAM as our indexes grow (currently around 23.5G and growing), but vertical scaling forever is obviously a non-starter. We’re also planning for a huge bump in data when we release to Android, so we started looking around at other solutions.
This conversation doesn’t go far before we start talking about NoSQL, but we don’t care about the “SQL-iness” of a solution. What we care about are:
- Horizontal scalability
- Keysets larger than RAM not killing the DB
- Ease of scaling (add a node, get performance)
- Ease of management (small ops team)
- Runs well on AWS (EC2)
- Great community and support
We spoke with our good friends at Wildbit (who, it so happens, power all of our transactional emails in Postmark and host all of our code in Beanstalk) about their scaling problems. They already went through this exercise (and continue to do so as they host an unbelievable amount of data) and were able to point us around a bit.
While PostgreSQL was on our radar for a while, the management of highly-scalable relational databases sent our heads spinning. Right now, our ops team is very, very small. So, we started looking at alternative solutions. Here’s what we came up with:
CouchDB and MongoDB
You can’t have an intelligent conversation if you aren’t at least slightly versed on CouchDB and MongoDB. Things they have in common are:
- Document store with “Collections” (think tablespace)
- Relatively easy to code for
- Great libraries
- Great community
- Keyset must be in RAM
- Swapping can bring down the DB instantly
- Adding capacity requires understanding sharding, config servers, query routers, etc.
This is hardly a comprehensive list, but there were a couple non-starters for us.
CouchDB’s data files require compaction, so that’s no good (an ops problem), and compaction has been known to fail silently, which is even worse for non-experts.
We talked with several companies we’re friendly with, and MongoDB became less and less favorable just in the ops story. While each company has its own story to tell, none were particularly good.
But the biggest roadblock for us is the keyset must remain in RAM. After hearing horror stories of what happens when keysets swapped to disk (even SSDs, although it’s a much better story with SSDs) we moved on. We can’t afford EC2 instances with dedicated SSD drives, and need our DB to keep serving when we reach capacity or are adding a new node to the cluster.
(I should mention Wildbit uses CouchDB for some things, and are quite happy with it, but they have a much more significant ops team. Read about their decision to move off of MongoDB and on to CouchDB with the help of Cloudant.)
UPDATE: Originally, this write-up stated that compaction could cause data loss. Adam Kocoloski from Cloudant took a bit of issue with this in the comments below. To his credit, while you can lose earlier versions of documents on compaction, as long as you’re doing manual conflict resolution, it shouldn’t be an issue. He also notes that CouchDB has gotten better at graceful degradation of keyspace overrunning RAM capacity.
Casandra and Hadoop/HBase
Both of these databases crossed our plates as well because they meet a lot of the things we want… They scale to incredible sizes, deal with huge datasets easily, etc. But the more research we did, the more we understood that Casandra would require a hefty amount of ops. Adding a node to the cluster is not a simple task, and neither is getting a cluster set up properly. While it has absolutely brilliant parts, it also has fear-inducing parts.
Hadoop came close, but also had heavy requirements when it came to understanding the underlying architecture, and is really built to deal with datasets in the terrabyte-petabyte range – and that’s not us.
We also looked at:
- PostgreSQL - it has a nice new key-value column format called hstore, and the idea of getting the best of both worlds – relational and KV in the same solution – was really tempting. But the ops requirements are not trivial, and it runs in to many of the same scaling issues that are making us hop off of MySQL
- Neo4j – not entirely proven, and the graph features are not something we’d use heavily
- And about a bajillion others.
There is a seemingly obvious path with DynamoDB here. We’re pretty tightly coupled in to AWS’s infrastructure (because we like to be), and DynamoDB would seem to make sense – in fact, it was in the lead for most of the time. However, there are downsides.
Werner Vogel’s original post and the Dynamo Paper about DynamoDB stirred things up a lot. It’s integral to running Amazon.com, and many of its ideas make incredible amounts of sense. Huge pros came out:
- “Unlimited” scale
- Highly distributed
- Managed – hides almost all ops, reducing load on our team
- Highly available
- Cost efficient
These are some tasty treats. But the pricing is difficult to understand, and more difficult to plan for. Each “write” operation only accepts 1k of data, and our game save state is often over that – meaning each game update can require 4 “write” units or more. In order to not have DynamoDB spit errors at you, you have to “provision” a certain amount of reads/writes per second, so we’ll always be paying for the worst case scenario, and have the potential to get screwed if we ever get, say, featured by Apple on the iOS store. Even worse – scaling up your capacity can take hours, during which old capacity limits will be in place.
Further limitations include:
- “Secondary indexes” cost more write units per save, meaning even small writes that have secondary indexes can quickly clog up your provisioned capacity
- No MapReduce
- No searching – you can set up CloudSearch, but this is additional cost, and its up to you to manually submit updates from DynamoDB to CloudSearch
- No triggers
- No links between data items
- REALLY difficult to “take your data with you”
- No counters (we do need to be able to replace “select count(*) from…” queries)
- No TTL for entries
Finally, Riak (by Basho)
Despite all the positives of DynamoDB, the closed-source, proprietary nature of it, along with severely slow scaling limitations, and a lack of decent querying capabilities (without involving even more AWS offerings) made us take a solid look at Riak.
Riak ticked a lot of boxes. First, its philosophy is very closely tied to many of DynamoDB concepts we like. (Here’s their reprint of the Dynamo paper, with annotations. Not a PDF, even!) And then there are the easy ones:
- Open source (Apache license)
- Multiple backends (including a memory-only backend that we may even consider replacing memcached/Redis with)
- Great community support
- A plethora of libraries
But how about the big one? Indexes in RAM? Riak’s default backend – Bitcask – does require that the entire keyset fit in RAM. It’s the fastest of the backends, obviously, but doesn’t have any support for secondary indexes. So for two reasons, this won’t fit our needs.
However, we can use Google’s LevelDB (Riak LevelDB documentation here). It has secondary index support, and does not require that all our indexes fit in RAM. This is big for us, no, huge for us, because a lot of data becomes historical as time goes on. It’s important that indexes that are old (Least Recently Used, or “LRU”) can pass out of RAM. It’s not that we don’t want to have to keep machines upgraded, but we can’t have the DB come to a screeching halt if we page.
Yes, Riak is an eventual-consistency store, but most document or KV store databases are in a distributed environment.
Then there are these features, which add on to the niceties of Dynamo:
- Map Reduce (albeit slow as heck, it’s available for data analysis)
- Pre- and post-commit triggers for doing… stuff.
- Link following
- TTLs on data items
- Buckets, which are kinda like namespaces (and not like tables)
- The ability to use multiple backends at one time – which means you can tailor buckets by need.
- We can easily “take our data with us” if we ever move to bare metal
Minus secondary indexes, sometimes you want to QUERY for something. Riak has Riak Search, but enough about that. Let’s talk about Yokozuna – their codename for direct Solr integration. Unlike CloudSearch on top of DynamoDB, Solr is directly integrated in to the core workings of Riak. You create an index in Solr, associate it with a bucket, create the schema you want (or don’t, the defaults will pick out values dynamically) and it adds a
post-commit ”magic” trigger to a bucket that will instantly index an item when it is written or updated.
This way a little bit of brilliance lies. First, as Riak is a distributed database, Distributed Solr is used for the Solr integration. Secondly, by building a decent Solr schema, not only can you do away with many of your secondary-index needs, but in many instances, get the results you need from Solr directly, and never have to query Riak.
Wildbit does this a lot. While they started with a MongoDB/Elastic Search setup, their eventual move to CouchDB was in part because it fit so nicely in to their already-running Elastic Search solution. With Riak, we’re getting something very similar, and minus initial setup, takes almost no management to keep up to date. (Just in case something does get behind, Riak has a neat little feature called Active Anti-Entropy which will take care of it.)
Next we have ease of cluster management. Minus the cluster admin console which makes things even easier, cluster management is about as simple as it gets. Adding a node to a cluster consists of about two command-line steps. This means node management is insanely easy for us ops-short teams.
Riak’s architecture also allows for rolling server upgrades, rolling software updates, node replacement, and so on. Add on a few monitoring scripts (we’ll be using Hosted Graphite) and we’re good to go. (Fun bonus – Hosted Graphite uses Riak on the backend.)
Nothing is ever as easy as it seems, and Riak will obviously present roadblocks we haven’t prepared for. But the official IRC channel (#riak on Freenode) is almost always staffed with Basho employees, they reply to email quite quickly, and most interesting to us is that they have “indie” pricing for their enterprise services, which we’ll likely avail ourselves of in the future.
Over time, we’ll let you know how our migration is going, and maybe this post will help you when you’re ready to move from MySQL to a distributed, horizontally scalable database solution.