Technology at Flyclops

Building Riak on OmniOS

posted in DevOps on 2013-08-19 00:00:00 UTC by

When we first set out about using Riak, we chose to develop against the Yokozuna release - an alpha-stage integration of Solr in to the core of Riak (but not “riak-core” as it were), replacing the never-quite-amazing Riak Search. Although still in alpha, we didn’t have time to re-architect our server infrastructure and API once against Riak Search, and then again against Solr once Riak 2.0 is released.

We followed the basic constructs of getting it installed on Ubuntu Linux 12.10 or 13.04 (both of which we run in production for our API servers), and it worked fine.

But the more we researched about the best way to back up Riak, the more we figured out that taking a node down, copying its data (or, in our case, we could snapshot an EBS volume in AWS), spinning the node back up, and then waiting for it to catch up again (and then doing that in a rolling fashion for N nodes) was just silly.

So after chatting again with the awesome team at Wildbit about some of their solutions, we decided to give OmniOS a try. It’s an operating system built by OmniTI on top of illumos, a fork of the relatively-defunct Open Solaris project, by many of the people that worked on Open Solaris in its day.

Needless to say, UNIX is the mainframe of our day. It’s incredibly rock-solid, and every release is basically a Long Term Support release (LTS) because, short of serious security updates, an installation will run solidly - well - forever.

But we didn’t just hop from Linux to UNIX because I’m nostalgic for the Sun Sparcstations and Xterms in the basement of the CoRE building at Rutgers University.

XTERMS Image: Lovingly stolen from http://eceweb1.rutgers.edu/~bushnell/

Well, maybe a little bit.

But mostly, it was for ZFS.

ZFS allows us to take live-snapshots of any file system (including intelligent diffs instead of full snapshots) and ship them off to a backup server. It means zero downtime for a node with full backups. Wildbit has some of its servers snapshot once per second - that’s how granular you can get with your backups. Restoration is incredibly easy, and did I mention there is zero node downtime?

I’ll write more about our final ZFS snapshot setup when we’ve finalized the architecture. But first, we had to get Riak building on OmniOS - both for development and production.

Before I just spit code on to the page, you should know the biggest issue we had to tackle was that everything was being built 32 bit. OmniOS runs with dual instruction set support (x86 and x86-64), and in many scenarios, defaults to 32 bit. Once we figured out that’s why we were being denied disk access (we were getting “Too many files open” errors even with our open file limit set to 65536) and running in to other issues, getting a final build script became much easier.

Let’s get started

NOTE: Many, many thanks go to to the people on the #riak and #omnios IRC channels on Freenode, as well as the Riak Users and OmniOS Discuss mailing lists for all of their help.

We’ll start by installing the following packages:

NOTE: Thanks to Eric Sproul from OmniTI for pointing out that they maintain root certificates in a package, I’ve moved the CA certificate install in to this list.

1
2
3
4
5
6
7
8
9
# Install compilers, gmake, etc.
pkg install pkg:/developer/versioning/git
pkg install pkg:/developer/build/gnu-make
pkg install pkg:/developer/gcc46
pkg install pkg:/system/header
pkg install pkg:/system/library/math/header-math
pkg install pkg:/developer/object-file
pkg install pkg:/network/netcat
pkg install pkg:/web/ca-bundle

These packages replace things like build-essentials, et. al. in Linux. We’re using GCC 4.6 instead of 4.7 (4.7.2 on OmniOS currently) because of a compilation bug that is fixed in 4.7.3 and 4.8 (but there’s no package for either of those versions yet).

Next, download and install Java JDK from Oracle. Make sure to use Oracle’s JDK, not OpenJDK, and you must, must, must download the 64 bit versions. We downloaded the JDK i586 and x64 .gz files and put them in a private S3 bucket for ease of access.

1
2
3
# Download files from S3 in to folder such as /opt/<YOUR NAME>/java
gzip -dc jdk-7u25-solaris-i586.gz | tar xf -
gzip -dc jdk-7u25-solaris-x64.gz | tar xf -

We have all these fun tools, so let’s set up some PATH’ing:

1
2
3
4
NEWPATH=/opt/gcc-4.6.3/bin/:/usr/gnu/bin:/opt/flyclops/java/jdk1.7.0_25/bin/amd64:/opt/flyclops/java/jdk1.7.0_25/bin:/opt/omni/bin:$PATH
  
export PATH=$NEWPATH
echo "export PATH=$NEWPATH" >> /root/.profile

Now we’ll create an unprivileged user to run Riak under, and the directory for holding Riak. We create a directory at /riak, which I’m sure we’ll get yelled at for in the comments:

1
2
3
4
5
6
7
8
# Make riak user
useradd -m -d /export/home/riak riak  
usermod -s /bin/bash riak
echo "export PATH=$NEWPATH" >> /export/home/riak/.profile

# Create the riak dir
mkdir /riak
chown riak:other /riak

Our final build tool is kerl - we use it for easily building and installing Erlang.

1
2
3
# Install kerl
curl -O -silent https://raw.github.com/spawngrid/kerl/master/kerl; chmod a+x kerl
mv kerl /usr/bin

Last but not least, we want to make sure our open file limit is nice and high. There are two ways to do this in OmniOS, this shows both. The first is user-specific and more correct, the other is system-wide:

1
2
3
4
5
6
7
8
9
# Add project settings for riak  
sudo projadd -c "riak" -K "process.max-file-descriptor=(basic,65536,deny)" user.riak

# Add limits to /etc/system
bash -c "echo ‘set rlim_fd_max=65536' >> /etc/system"
bash -c "echo ‘set rlim_fd_cur=65536' >> /etc/system"

# Set this session's limit high  
ulimit -n 65536

Building Erlang and Riak as the `riak` user

We want to run the rest of the commands as the unprivileged riak user. If you’re going to do this by hand, you can simply call sudo su - riak. If you’re doing this via a shell script, create a second script for building Erlang and Riak, and launch it via your first being run as root. Example:

1
2
# Run stage 2 builds as riak user
sudo -u riak sh /path/to/riak_build_stage2.sh

Now that we’re riak we can build Erlang. As of this writing, R15B01 is required for Riak 1.4.1, and R15B02 is required for the latest Yokozuna build. Choose appropriately. Please note the --enable-m64-build flag - this is very important. It (along with other commands further down) will ensure a 64 bit build, not a 32 bit build.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Build Erlang
# MAKE SURE WE USE -enable-m64-build
echo 'KERL_CONFIGURE_OPTIONS="-disable-hipe -enable-smp-support -enable-threads -enable-kernel-poll -enable-m64-build"' > ~/.kerlrc
kerl build R15B02 r15b02

# Install and activate Erlang  
mkdir -p /riak/erlang/r15b02
kerl install r15b02 /riak/erlang/r15b02

. /riak/erlang/r15b02/activate

Finally, download (however you choose) a Riak source package and build it. Here we’ll download the Yokozuna 0.8.0 release and build it. Again, note that we’re setting CFLAGS to use -m64 to ensure a 64 bit build.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Get riak  
cd /riak
wget -nv http://data.riakcs.net:8080/yokozuna/riak-yokozuna-0.8.0-src.tar.gz
tar zxvf riak-yokozuna-0.8.0-src.tar.gz
mv riak-yokozuna-0.8.0-src riak

# Build riak  
cd /riak/riak
CC=gcc CXX=g++ CFLAGS="-m64" make

# "Install" riak, using "make rel" or "make stagedevrel"  
CC=gcc CXX=g++ CFLAGS="-m64" make rel

There you have it. Riak (and if you like, Yokozuna) on OmniOS (currently, R151006p). We’ve done all the build experimenting, so hopefully this will save you some time!

Taking the Riak plunge

posted in Databases on 2013-07-31 00:00:00 UTC by Dave Martorana

As with most companies that approach the 100 million-row count in MySQL, we’re starting to hit growing pains. We’re also a bit abnormal in our DB needs - our load is about 45% write. This is rather unfortunate for us - read replicas are way less useful for our particular needs, especially as you still have to have a single write-master. We realize there are master-master setups, but the complications are too much to deal with.

So we’ve been adding RAM as our indexes grow (currently around 23.5G and growing), but vertical scaling forever is obviously a non-starter. We’re also planning for a huge bump in data when we release to Android, so we started looking around at other solutions.

This conversation doesn’t go far before we start talking about NoSQL, but we don’t care about the “SQL-iness” of a solution. What we care about are:

  • Horizontal scalability
  • Speed
  • Keysets larger than RAM not killing the DB
  • Ease of scaling (add a node, get performance)
  • Ease of management (small ops team)
  • Runs well on AWS (EC2)
  • Great community and support
  • Load-balanced

We spoke with our good friends at Wildbit (who, it so happens, power all of our transactional emails in Postmark and host all of our code in Beanstalk) about their scaling problems. They already went through this exercise (and continue to do so as they host an unbelievable amount of data) and were able to point us around a bit.

While PostgreSQL was on our radar for a while, the management of highly-scalable relational databases sent our heads spinning. Right now, our ops team is very, very small. So, we started looking at alternative solutions. Here’s what we came up with:

CouchDB and MongoDB

You can’t have an intelligent conversation if you aren’t at least slightly versed on CouchDB and MongoDB. Things they have in common are:

  • Document store with “Collections” (think tablespace)
  • Relatively easy to code for
  • Great libraries
  • Great community
  • Keyset must be in RAM
  • Swapping can bring down the DB instantly
  • Adding capacity requires understanding sharding, config servers, query routers, etc.

This is hardly a comprehensive list, but there were a couple non-starters for us.

CouchDB’s data files require compaction, so that’s no good (an ops problem), and compaction has been known to fail silently, which is even worse for non-experts.

We talked with several companies we’re friendly with, and MongoDB became less and less favorable just in the ops story. While each company has its own story to tell, none were particularly good.

But the biggest roadblock for us is the keyset must remain in RAM. After hearing horror stories of what happens when keysets swapped to disk (even SSDs, although it’s a much better story with SSDs) we moved on. We can’t afford EC2 instances with dedicated SSD drives, and need our DB to keep serving when we reach capacity or are adding a new node to the cluster.

(_I should mention Wildbit uses CouchDB for some things, and are quite happy with it, but they have a much more significant ops team. Read about their decision to move off of MongoDB and on to CouchDB with the help of Cloudant._)

_UPDATE: Originally, this write-up stated that compaction could cause data loss. Adam Kocoloski from Cloudant took a bit of issue with this in the comments below. To his credit, while you can lose earlier versions of documents on compaction, as long as you’re doing manual conflict resolution, it shouldn’t be an issue. He also notes that CouchDB has gotten better at graceful degradation of keyspace overrunning RAM capacity. _

Casandra and Hadoop/HBase

Both of these databases crossed our plates as well because they meet a lot of the things we want… They scale to incredible sizes, deal with huge datasets easily, etc. But the more research we did, the more we understood that Casandra would require a hefty amount of ops. Adding a node to the cluster is not a simple task, and neither is getting a cluster set up properly. While it has absolutely brilliant parts, it also has fear-inducing parts.

Hadoop came close, but also had heavy requirements when it came to understanding the underlying architecture, and is really built to deal with datasets in the terrabyte-petabyte range - and that’s not us.

Others

We also looked at:

  • PostgreSQL - it has a nice new key-value column format called hstore, and the idea of getting the best of both worlds - relational and KV in the same solution - was really tempting. But the ops requirements are not trivial, and it runs in to many of the same scaling issues that are making us hop off of MySQL
  • Neo4j - not entirely proven, and the graph features are not something we’d use heavily
  • And about a bajillion others.

DynamoDB

There is a seemingly obvious path with DynamoDB here. We’re pretty tightly coupled in to AWS’s infrastructure (because we like to be), and DynamoDB would seem to make sense - in fact, it was in the lead for most of the time. However, there are downsides.

Werner Vogel’s original post and the Dynamo Paper about DynamoDB stirred things up a lot. It’s integral to running Amazon.com, and many of its ideas make incredible amounts of sense. Huge pros came out:

  • “Unlimited” scale
  • Highly distributed
  • Managed - hides almost all ops, reducing load on our team
  • Highly available
  • Cost efficient
  • SSD-backed

These are some tasty treats. But the pricing is difficult to understand, and more difficult to plan for. Each “write” operation only accepts 1k of data, and our game save state is often over that - meaning each game update can require 4 “write” units or more. In order to not have DynamoDB spit errors at you, you have to “provision” a certain amount of reads/writes per second, so we’ll always be paying for the worst case scenario, and have the potential to get screwed if we ever get, say, featured by Apple on the iOS store. Even worse - scaling up your capacity can take hours, during which old capacity limits will be in place.

Further limitations include:

  • “Secondary indexes” cost more write units per save, meaning even small writes that have secondary indexes can quickly clog up your provisioned capacity
  • No MapReduce
  • No searching - you can set up CloudSearch, but this is additional cost, and its up to you to manually submit updates from DynamoDB to CloudSearch
  • No triggers
  • No links between data items
  • Proprietary
  • REALLY difficult to “take your data with you”
  • No counters (we do need to be able to replace “select count(*) from…” queries)
  • No TTL for entries

Finally, Riak (by Basho)

Despite all the positives of DynamoDB, the closed-source, proprietary nature of it, along with severely slow scaling limitations, and a lack of decent querying capabilities (without involving even more AWS offerings) made us take a solid look at Riak.

Riak ticked a lot of boxes. First, its philosophy is very closely tied to many of DynamoDB concepts we like. (Here’s their reprint of the Dynamo paper, with annotations. Not a PDF, even!) And then there are the easy ones:

  • Open source (Apache license)
  • Multiple backends (including a memory-only backend that we may even consider replacing memcached/Redis with)
  • Great community support
  • A plethora of libraries

But how about the big one? Indexes in RAM? Riak’s default backend - Bitcask - does require that the entire keyset fit in RAM. It’s the fastest of the backends, obviously, but doesn’t have any support for secondary indexes. So for two reasons, this won’t fit our needs.

However, we can use Google’s LevelDB (Riak LevelDB documentation here). It has secondary index support, and does not require that all our indexes fit in RAM. This is big for us, no, huge for us, because a lot of data becomes historical as time goes on. It’s important that indexes that are old (Least Recently Used, or “LRU”) can pass out of RAM. It’s not that we don’t want to have to keep machines upgraded, but we can’t have the DB come to a screeching halt if we page.

Yes, Riak is an eventual-consistency store, but most document or KV store databases are in a distributed environment.

Then there are these features, which add on to the niceties of Dynamo:

  • Map Reduce (albeit slow as heck, it’s available for data analysis)
  • Pre- and post-commit triggers for doing… stuff.
  • Link following
  • Counters
  • TTLs on data items
  • Buckets, which are kinda like namespaces (and not like tables)
  • The ability to use multiple backends at one time - which means you can tailor buckets by need.
  • We can easily “take our data with us” if we ever move to bare metal
  • SEARCH!

Minus secondary indexes, sometimes you want to QUERY for something. Riak has Riak Search, but enough about that. Let’s talk about Yokozuna - their codename for direct Solr integration. Unlike CloudSearch on top of DynamoDB, Solr is directly integrated in to the core workings of Riak. You create an index in Solr, associate it with a bucket, create the schema you want (or don’t, the defaults will pick out values dynamically) and it adds a post-commit “magic” trigger to a bucket that will instantly index an item when it is written or updated.

This way a little bit of brilliance lies. First, as Riak is a distributed database, Distributed Solr is used for the Solr integration. Secondly, by building a decent Solr schema, not only can you do away with many of your secondary-index needs, but in many instances, get the results you need from Solr directly, and never have to query Riak.

Wildbit does this a lot. While they started with a MongoDB/Elastic Search setup, their eventual move to CouchDB was in part because it fit so nicely in to their already-running Elastic Search solution. With Riak, we’re getting something very similar, and minus initial setup, takes almost no management to keep up to date. (Just in case something does get behind, Riak has a neat little feature called Active Anti-Entropy which will take care of it.)

Next we have ease of cluster management. Minus the cluster admin console which makes things even easier, cluster management is about as simple as it gets. Adding a node to a cluster consists of about two command-line steps. This means node management is insanely easy for us ops-short teams.

Riak’s architecture also allows for rolling server upgrades, rolling software updates, node replacement, and so on. Add on a few monitoring scripts (we’ll be using Hosted Graphite) and we’re good to go. (Fun bonus - Hosted Graphite uses Riak on the backend.)

Conclusion

Nothing is ever as easy as it seems, and Riak will obviously present roadblocks we haven’t prepared for. But the official IRC channel (#riak on Freenode) is almost always staffed with Basho employees, they reply to email quite quickly, and most interesting to us is that they have “indie” pricing for their enterprise services, which we’ll likely avail ourselves of in the future.

Over time, we’ll let you know how our migration is going, and maybe this post will help you when you’re ready to move from MySQL to a distributed, horizontally scalable database solution.

Tagged:
#riak  #databases

Ubuntu 13.04 Raring base box for VMware Fusion Provider

posted in DevOps on 2013-07-25 00:00:00 UTC by

As nice as the VMware Fusion provider for Vagrant might be, the lack of working boxes is absurd for the cost of the provider plugin license. In any case, we got a 13.04 Raring box set up.

We started with the Precise 12.04 LTS box that they do provide and ran a few

sudo do-dist-upgrade

calls to get to 12.10 and then 13.04. This of course breaks the VMware Tools installation and thus Vagrant, as it attempted to mount shared folders.

Following some instructions here (warning, Google Cache link because the Ubuntu Forums are down for now) for patching the VMware Tools on VMware Fusion 5.0.3 for 13.04, we re-installed the VMware Tools and got everything working.

The base box is available here:

http://com_flyclops_bin.s3.amazonaws.com/ubuntu-13.04-vmware.box

Cloud-init in Vagrant with Ubuntu 12.10 and 13.04

posted in DevOps on 2013-06-21 00:00:00 UTC by Dave Martorana

Flyclops’s entire stack runs on AWS, and uses CloudFormation and AutoScale Groups to launch servers on the fly, with no intervention on our part. We don’t use custom AMIs, but vanilla Ubuntu server images (12.10 at the time of writing this) and Ubuntu’s very cool CloudInit functionality to bootstrap building servers with shell scripts. I want to use the same identical method to bootstrap my Vagrant boxes for local development. Here’s how to do it.

(Note: This is a cross-post from here.)

I’ll go further into how we bootstrap - templated shell files that compile in to multipart mime messages, gzipped and put on S3, and a single bootstrap-to-the-bootstrap script being passed in to the server as user data - in another post. But I wanted my local servers to run almost identical boot scripts via Vagrant so my dev environment was almost identical to my production environment.

Ubuntu provides cloud images that are used in EC2, as well as OpenStack, Rackspace, etc., and even have versions for Vagrant. The Vagrant versions have their own cloud-init boot scripts built in to set up networking and whatnot… but I want them to then run my scripts. Obviously.

Vagrant has this really nice thing where you can supply a shell script to run the first time your boxes start. Simply put something similar to this to have it run your file:

1
2
# Run our shell script on provisioning
config.vm.provision :shell, :path => "vagrant_build.sh"

Next we want to provide a very small shell script that clears out cloud-init run-info and provides us access to have it run again. Here’s the script, we’ll step through it below.

PLEASE NOTE: This method of working with cloud-init does not work on Ubuntu 12.04 or earlier - cloud-init was significantly updated between 12.04 and 12.10.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Check to see if we have done this already  
if [ -f /.vagrant_build_done ]; then
  echo "Found, not running."
  exit
fi

# Make the box think it hasn't init-ed yet
rm -rf /var/lib/cloud/instance/*  
rm -rf /var/lib/cloud/seed/nocloud-net/user-data

# Seed our own init scripts  
cat << 'END_OF_FILE_CONTENTS' > /var/lib/cloud/seed/nocloud-net/user-data  
Content-Type: multipart/mixed; boundary="===============apiserversStackMultipartMessage=="
MIME-Version: 1.0
-===============apiserversStackMultipartMessage==

#include
https://someS3bucket.s3.amazonaws.com/somefolder/vagrant/someDateStampedFile.gz
  
-===============apiserversStackMultipartMessage==-

END_OF_FILE_CONTENTS

# Re-run cloud-init  
cloud-init init
cloud-init modules -mode init
cloud-init modules -mode config
cloud-init modules -mode final

# Do not let this run again  
touch /.vagrant_build_done

First things first - we don’t want this to optionally repeat rerunning cloud-init, so we check for a record that this has run before:

1
2
3
4
5
# Check to see if we have done this already 
if [ -f /.vagrant_build_done ]; then
  echo "Found, not running."    
  exit
fi

Assuming we don’t find anything, we’re off to the races. Lines 8-9 clean up the existing cloud-init trail, almost re-virgining the machine. That said, we’re going to re-use the meta-data file from the vanilla image’s nocloud-net seed, and simply supply our own user-data. We also clear out the “instance” directory so our new scripts can be written there by cloud-init.

1
2
3
# Make the box think it hasn't init-ed yet 
rm -rf /var/lib/cloud/instance/*
rm -rf /var/lib/cloud/seed/nocloud-net/user-data

Lines 12-23 do the heavy-lifting of moving a script in to place for us. Again, we’re putting this in

/var/lib/cloud/seed/nocloud-net/user-data

and using cat to move the contents of the file for us.

Note: This method of using bash scripts to write files is ugly, but works reliably. I use this method a lot - but I have a custom Jinja2 filter that writes out the ugly for me, keeping my template files nice and clean.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Seed our own init script  
cat << 'END_OF_FILE_CONTENTS' > /var/lib/cloud/seed/nocloud-net/user-data
Content-Type: multipart/mixed; boundary="===============apiserversStackMultipartMessage=="
MIME-Version: 1.0

-===============apiserversStackMultipartMessage==

#include
https://someS3bucket.s3.amazonaws.com/somefolder/vagrant/someDateStampedFile.gz

-===============apiserversStackMultipartMessage==-
END_OF_FILE_CONTENTS

Notice on line 18, we’re actually writing out

#include

which tells cloud-init to “include” scripts at that url. According to the docs,

> the content read from the URL can be gzipped, mime-multi-part, or plain text

so I store a gzipped multipart mime file. Works like a charm, and for EC2, gets us around the user-data size limit for supplying scripts to cloud-init.

Now comes the fun - with cloud-init cleaned out and our new user-data in, we simply force-run cloud-init again.

1
2
3
4
5
# Re-run cloud-init  
cloud-init init
cloud-init modules -mode init
cloud-init modules -mode config
cloud-init modules -mode final

First we initialize it, then run “modules” through initialization, configuration, and finalization. The line

cloud-init modules -mode final

is the most important for us. Through initialization and configuration, cloud-init has downloaded our gzip file from S3, has unzipped and separated out all of the parts of our multipart-mime message in to component shell scripts, but has not yet run those shell scripts. Line 29 asks cloud-init to now run those scripts.

Line 32 is just so that if Vagrant re-executes this build script at any point in time, we don’t run cloud-init through the initialization process again.

touch /.vagrant_build_done

So that’s it. Despite the length of this post, it’s a very short script. I should note that you can also add in any other initialization needed specifically for Vagrant boxes after this - like sym-linking in /vagrant (the shared directory) in to a particular location, etc., to finalize the box for local development.