Taking the Riak plunge

As with most companies that approach the 100 million-row count in MySQL, we’re starting to hit growing pains. We’re also a bit abnormal in our DB needs – our load is about 45% write. This is rather unfortunate for us – read replicas are way less useful for our particular needs, especially as you still have to have a single write-master. We realize there are master-master setups, but the complications are too much to deal with.

So we’ve been adding RAM as our indexes grow (currently around 23.5G and growing), but vertical scaling forever is obviously a non-starter. We’re also planning for a huge bump in data when we release to Android, so we started looking around at other solutions.

This conversation doesn’t go far before we start talking about NoSQL, but we don’t care about the “SQL-iness” of a solution. What we care about are:

  • Horizontal scalability
  • Speed
  • Keysets larger than RAM not killing the DB
  • Ease of scaling (add a node, get performance)
  • Ease of management (small ops team)
  • Runs well on AWS (EC2)
  • Great community and support
  • Load-balanced

We spoke with our good friends at Wildbit (who, it so happens, power all of our transactional emails in Postmark and host all of our code in Beanstalk) about their scaling problems. They already went through this exercise (and continue to do so as they host an unbelievable amount of data) and were able to point us around a bit.

While PostgreSQL was on our radar for a while, the management of highly-scalable relational databases sent our heads spinning. Right now, our ops team is very, very small. So, we started looking at alternative solutions. Here’s what we came up with:

CouchDB and MongoDB

You can’t have an intelligent conversation if you aren’t at least slightly versed on CouchDB and MongoDB. Things they have in common are:

  • Document store with “Collections” (think tablespace)
  • Relatively easy to code for
  • Great libraries
  • Great community
  • Keyset must be in RAM
  • Swapping can bring down the DB instantly
  • Adding capacity requires understanding sharding, config servers, query routers, etc.

This is hardly a comprehensive list, but there were a couple non-starters for us.

CouchDB’s data files require compaction, so that’s no good (an ops problem),  and compaction has been known to fail silently, which is even worse for non-experts.

We talked with several companies we’re friendly with, and MongoDB became less and less favorable just in the ops story. While each company has its own story to tell, none were particularly good.

But the biggest roadblock for us is the keyset must remain in RAM. After hearing horror stories of what happens when keysets swapped to disk (even SSDs, although it’s a much better story with SSDs) we moved on. We can’t afford EC2 instances with dedicated SSD drives, and need our DB to keep serving when we reach capacity or are adding a new node to the cluster.

(I should mention Wildbit uses CouchDB for some things, and are quite happy with it, but they have a much more significant ops team. Read about their decision to move off of MongoDB and on to CouchDB with the help of Cloudant.)

UPDATE: Originally, this write-up stated that compaction could cause data loss. Adam Kocoloski from Cloudant took a bit of issue with this in the comments below. To his credit, while you can lose earlier versions of documents on compaction, as long as you’re doing manual conflict resolution, it shouldn’t be an issue. He also notes that CouchDB has gotten better at graceful degradation of keyspace overrunning RAM capacity. 

Casandra and Hadoop/HBase

Both of these databases crossed our plates as well because they meet a lot of the things we want… They scale to incredible sizes, deal with huge datasets easily, etc. But the more research we did, the more we understood that Casandra would require a hefty amount of ops. Adding a node to the cluster is not a simple task, and neither is getting a cluster set up properly. While it has absolutely brilliant parts, it also has fear-inducing parts.

Hadoop came close, but also had heavy requirements when it came to understanding the underlying architecture, and is really built to deal with datasets in the terrabyte-petabyte range – and that’s not us.

Others

We also looked at:

  • PostgreSQL  – it has a nice new key-value column format called hstore, and the idea of getting the best of both worlds – relational and KV in the same solution – was really tempting. But the ops requirements are not trivial, and it runs in to many of the same scaling issues that are making us hop off of MySQL
  • Neo4j – not entirely proven, and the graph features are not something we’d use heavily
  • And about a bajillion others.

DynamoDB

There is a seemingly obvious path with DynamoDB here. We’re pretty tightly coupled in to AWS’s infrastructure (because we like to be), and DynamoDB would seem to make sense – in fact, it was in the lead for most of the time. However, there are downsides.

Werner Vogel’s original post and the Dynamo Paper about DynamoDB stirred things up a lot. It’s integral to running Amazon.com, and many of its ideas make incredible amounts of sense. Huge pros came out:

  • “Unlimited” scale
  • Highly distributed
  • Managed – hides almost all ops, reducing load on our team
  • Highly available
  • Cost efficient
  • SSD-backed

These are some tasty treats. But the pricing is difficult to understand, and more difficult to plan for. Each “write” operation only accepts 1k of data, and our game save state is often over that – meaning each game update can require 4 “write” units or more. In order to not have DynamoDB spit errors at you, you have to “provision” a certain amount of reads/writes per second, so we’ll always be paying for the worst case scenario, and have the potential to get screwed if we ever get, say, featured by Apple on the iOS store. Even worse – scaling up your capacity can take hours, during which old capacity limits will be in place.

Further limitations include:

  • “Secondary indexes” cost more write units per save, meaning even small writes that have secondary indexes can quickly clog up your provisioned capacity
  • No MapReduce
  • No searching – you can set up CloudSearch, but this is additional cost, and its up to you to manually submit updates from DynamoDB to CloudSearch
  • No triggers
  • No links between data items
  • Proprietary
  • REALLY difficult to “take your data with you”
  • No counters (we do need to be able to replace “select count(*) from…” queries)
  • No TTL for entries

Finally, Riak (by Basho)

Despite all the positives of DynamoDB, the closed-source, proprietary nature of it, along with severely slow scaling limitations, and a lack of decent querying capabilities (without involving even more AWS offerings) made us take a solid look at Riak.

Riak ticked a lot of boxes. First, its philosophy is very closely tied to many of DynamoDB concepts we like. (Here’s their reprint of the Dynamo paper, with annotations. Not a PDF, even!) And then there are the easy ones:

  • Open source (Apache license)
  • Multiple backends (including a memory-only backend that we may even consider replacing memcached/Redis with)
  • Great community support
  • A plethora of libraries

But how about the big one? Indexes in RAM? Riak’s default backend – Bitcask – does require that the entire keyset fit in RAM. It’s the fastest of the backends, obviously, but doesn’t have any support for secondary indexes. So for two reasons, this won’t fit our needs.

However, we can use Google’s LevelDB (Riak LevelDB documentation here). It has secondary index support, and does not require that all our indexes fit in RAM. This is big for us, no, huge for us, because a lot of data becomes historical as time goes on. It’s important that indexes that are old (Least Recently Used, or “LRU”) can pass out of RAM. It’s not that we don’t want to have to keep machines upgraded, but we can’t have the DB come to a screeching halt if we page.

Yes, Riak is an eventual-consistency store, but most document or KV store databases are in a distributed environment.

Then there are these features, which add on to the niceties of Dynamo:

  • Map Reduce (albeit slow as heck, it’s available for data analysis)
  • Pre- and post-commit triggers for doing… stuff.
  • Link following
  • Counters
  • TTLs on data items
  • Buckets, which are kinda like namespaces (and not like tables)
  • The ability to use multiple backends at one time – which means you can tailor buckets by need.
  • We can easily “take our data with us” if we ever move to bare metal
  • SEARCH!

Minus secondary indexes, sometimes you want to QUERY for something. Riak has Riak Search, but enough about that. Let’s talk about Yokozuna – their codename for direct Solr integration. Unlike CloudSearch on top of DynamoDB, Solr is directly integrated in to the core workings of Riak. You create an index in Solr, associate it with a bucket, create the schema you want (or don’t, the defaults will pick out values dynamically) and it adds a post-commit “magic” trigger to a bucket that will instantly index an item when it is written or updated.

This way a little bit of brilliance lies. First, as Riak is a distributed database, Distributed Solr is used for the Solr integration. Secondly, by building a decent Solr schema, not only can you do away with many of your secondary-index needs, but in many instances, get the results you need from Solr directly, and never have to query Riak.

Wildbit does this a lot. While they started with a MongoDB/Elastic Search setup, their eventual move to CouchDB was in part because it fit so nicely in to their already-running Elastic Search solution. With Riak, we’re getting something very similar, and minus initial setup, takes almost no management to keep up to date. (Just in case something does get behind, Riak has a neat little feature called Active Anti-Entropy which will take care of it.)

Next we have ease of cluster management. Minus the cluster admin console which makes things even easier, cluster management is about as simple as it gets. Adding a node to a cluster consists of about two command-line steps. This means node management is insanely easy for us ops-short teams.

Riak’s architecture also allows for rolling server upgrades, rolling software updates, node replacement, and so on. Add on a few monitoring scripts (we’ll be using Hosted Graphite) and we’re good to go. (Fun bonus – Hosted Graphite uses Riak on the backend.)

Conclusion

Nothing is ever as easy as it seems, and Riak will obviously present roadblocks we haven’t prepared for. But the official IRC channel (#riak on Freenode) is almost always staffed with Basho employees, they reply to email quite quickly, and most interesting to us is that they have “indie” pricing for their enterprise services, which we’ll likely avail ourselves of in the future.

Over time, we’ll let you know how our migration is going, and maybe this post will help you when you’re ready to move from MySQL to a distributed, horizontally scalable database solution.

  • Adam Kocoloski

    Hi Dave, thanks for this writeup. It’s great to see people carefully consider the tradeoffs of these systems, and Riak is a fine datastore.

    I wanted to clarify that CouchDB’s compaction absolutely does not discard user data. If a document has multiple active revisions all of those revisions will survive compaction. The compactor only removes the document bodies of revisions that have been superseded by newer edits.

    CouchDB also degrades gracefully when the keyspace cannot fit in RAM. One tip that’s useful for storage of time series data in databases that are backed by btrees is to choose IDs that incorporate the creation timestamp in some fashion. That way all of your recently (and frequently accessed) data gets colocated in the same portion of the index, which increases the system’s page cache hit rate and decreases the total amount of disk IO required to serve your requests. CouchDB will choose IDs like this for you automatically if you configure the UUID algorithm as “utc_random”.

    • http://davemartorana.com Dave Martorana

      Hi Adam. Thanks for the updated information on keys and performance degradation. I notice a lot of document/kv stores recommend using an intelligent key system.

      I was speaking from conversations I had with other companies that had tried CouchDB in production, and I assume CouchDB keeps getting better, addressing issues it has with performance – especially as the keyspace overtakes RAM. I edited the post to include a reference to your comment. Cheers!

  • coderoshi

    One correction: Yokozuna doesn’t update indexes by way of postcommit hooks. Rather we introduced a new mechanism to keep indexes up to date which fires whenever an object changes in any way, not merely when you trigger a write. This is great in cases when a node misses a write for some reason, like network congestion or a downed node. Riak Search did use postcommit, however.

    • http://davemartorana.com Dave Martorana

      Thanks! Updated.

  • Pingback: August 9, 2013: This Week at Engine Yard

  • Pingback: Riak in the Wild | Basho

  • Barry Morris (CEO, NuoDB Inc)

    Dave

    Good overview. Thanks for the perspective.

    Just for the record, PostgreSQL is a great piece of technology, and there are lots of situations in which it is a great solution. But it is certainly not a distributed database. It has a traditional RDBMS architecture that is predicated on a single server. Consequently running it in a distributed fashion requires the kind of cost, complexity and DBA effort to which you refer. The conventional way to do this involves sharding, replication and added caching layers, all three of which are band-aids to get around the limitations of the system. Again it’s not that PostgreSQL is a weak product. It was just never designed to run like this or support these kinds of applications.

    GOOGLE F1 is s distributed SQL/ACID database system (http://bit.ly/12R95NW, http://bit.ly/18Z7B5f). So is NuoDB (our product). Here is a good excerpt from summary in the F1 White Paper:

    “In recent years, conventional wisdom in the engineering
    community has been that if you need a highly scalable, high-
    throughput data store, the only viable option is to use a
    NoSQL key/value store, and to work around the lack of
    ACID transactional guarantees and the lack of conveniences
    like secondary indexes, SQL, and so on. When we sought
    a replacement for Google’s MySQL data store for the Ad-
    Words product, that option was simply not feasible: the
    complexity of dealing with a non-ACID data store in ev-
    ery part of our business logic would be too great, and there
    was simply no way our business could function without SQL
    queries. Instead of going NoSQL, we built F1, a distributed
    relational database system that combines high availability,
    the throughput and scalability of NoSQL systems, and the
    functionality, usability and consistency of traditional re-
    lational databases, including ACID transactions and SQL
    queries. ”

    This is the NuoDB view of the world. You don’t have to give up SQL or transactions. It doesn’t mean that SQL is the only answer, but certainly there is no need to abandon SQL for reasons of scale-out complexity/cost. And for applications in which you need transactional reliability or a powerful query language – that’s what NewSQL is for.

    Barry

    CEO, NuoDB Inc

  • Pingback: 2013 Basho Resources | Basho

  • Pingback: 2013年Bashoのリソース | Basho Japan