JsonTree Sublime Text 3 plugin

We use Sublime Text (2/3) pretty heavily at Flyclops. For me, it’s my sole code-editor. Some of the crew working on the Unity3d client app use it too.

We (and especially I) deal with a lot of JSON files. In my case, the majority of our backend is defined by AWS CloudFormation template files – and man, can they be long. So I wanted a way to browse them quickly.

So, I wrote this wonky, almost-working plugin for Sublime Text 3 (which is a “beta” but substantially better than 2) called JsonTree. It works by hitting a key combination (cmd-ctrl-j in OS X, or ctrl-alt-j in Windows/Linux) and brings up a tree-view of your JSON document that is browsable and searchable. Screenshot:

JsonTree screencap

JsonTree screencap

JsonTree has been accepted into the Sublime Package Repository. Grab the Package Control plugin and install JsonTree from there.

This is the second open-source code bit from Flyclops. A lot more is to come!

Cheers,

Dave

GoCM: Super-simple asynchronous GCM listener in Go

GoCM current status:  

We have been using pyapns, a very light-weight, fast, Python/Twisted based asynchronous push notification server for Apple Push Notification Services. It sent our very first push notification ever, and has sent well over 1.5 billion more since. I originally stumbled upon it while reading the Instagram Engineering Blog, and this post in particular.

With pyapns, we toss off an alert asynchronously and never block the API from returning as quickly as possible. It’s rock-solid.

GCM, or Google Cloud Messaging, is the technology that is used to send notifications to Android devices. Unlike APNS, GCM has an HTTP REST interface to GCM servers that is used to send “downstream messages” (server-to-device).[1]

The problem is that sending a message to GCM over HTTP is a synchronous call. Our average API response time is ~70ms, but sending a message synchronously added between 125-250ms on top of that.

So we needed to make GCM just as “fast” as APNS.

It’s likely not obvious yet, but we’re slowly beginning to move pieces of our monolithic Python-based API server application to a multi-service architecture built primarily on Go. At the speed that we’re answering API requests, we’ve hit the need a massively concurrent language that’s also capable of taking advantage of multi-core, multi-CPU systems. But more on this in later post.

No matter what language our API is built on, now or in the future, pyapns will likely continue to run our APNS push notifications. So, we needed something similar for GCM.

Enter GoCM.

GoCM follows some of the basic conventions of pyapns, but almost none of the architecture. It accepts a push request and returns immediately, handing the work off to a series of asynchronous goroutines. It can handle thousands of messages at a time, and doesn’t block the API.

GoCM listens via the HTTP protocol on port 5601 (or one of your choosing) and can accept messages to send from any application written in any language – from a BASH script to Python to C++ and everything inbetween.

It also has a (currently minor) reporting interface, and gracefully handles GCM canonical responses, allowing for them to be handled asynchronously as well.

Open Source

While we rely on a lot of open source software, and contribute code to many OSS projects, GoCM is the first piece of OSS originally authored at Flyclops that is being released for public use. If you’d like to give GoCM a try, head over to our GitHub repository and check it out.

GoCM is enabling us to grow with our Android traffic while still keeping those nice, fast response times. Maybe you can get some use out of it too.

[1] GCM is also beginning to push it’s long-open-connection solution for messaging that uses CSS/XMPP. However, there are several hurdles – first, it requires special device permissions, and with hundreds of thousands of copies of our games in the wild, it would take months to get them all updated to use the newer technology – and until then, we’d have to continue sending messages over the HTTP REST interface. Secondly, we have no need currently for GCM’s XMPP bidirectional protocol.

Amazing, big, portable (and cheap!) whiteboards

Every business needs whiteboards – large whiteboards. Some go for fancy frosted-glass whiteboards that are absolutely amazing, but require complicated mounting, very careful moving, and a bunch of money ($500-$700 and up).

Even large-sized wood whiteboards are incredibly expensive. We wanted whiteboard space at least 3′ x 5′, and we wanted several of them. The prices were in the $250-700 range, and were flimsy whiteboards requiring wall mounting to have any real sturdiness or stability.

So we got creative, and have some sweet whiteboards to show for it.

We decided we wanted them to be portable (we work at Philly Game Forge and our desk situation is constantly evolving), large, and most importantly, resiliant and self-supporting. I knew you could buy whiteboard paint, and started there. Friends of ours painted whole walls in IdeaPaint, awesome stuff that can cost north of $150 per GALLON.

Then we found this great company – Magic Murals – that sells custom-sized white-board roll-on material that is thick and beautiful. The price was right, so all we needed was something to make in to a whiteboard.

We headed to Home Depot and bought ourselves a couple of doors. 32″ x 80″ was a nice size, and two of them side-by-side would be brilliant. These doors do not come with pre-drilled door knob holes, so they’re perfect. Light (16 lbs), strong and tall, they can simply be leaned against a wall – no mounting needed.

Next we special-ordered the white-board material from Magic Murals. We bought it oversized, of course – 7′ x 3′ (or 84″ x 36″) so when we applied the material we would have overhang we could trim later. Each cut cost us $85. Each door cost $26. So far we’re in for $111.

A quick sanding of the door surface (and some wiping clean with a damp cloth), and the door was ready for application. We laid out the material, and peeled off the back, using eraser-blocks to smooth the material on to the doors.


 

After it was all applied, we let the glue cure to the door, and then using a utility knife, made quick work of trimming the material to the door size.

  

That was it! Now we have two large (7.5′ x 2.5′) white boards that cost $111 each, that we can use in a variety of ways. Want a huge surface? Put them next to each other. Multiple meetings? Split them up and move them to different locations. You can flip the boards to make use of the whole vertical space. You could even mount a rail on the wall, and lift the whiteboards in to place in a conference room. It’s the flexibility of use that makes them amazing.


Most importantly, if as we put these through their paces, we really really love them, we’ll order more material and apply it to the back side of the doors, thus making the whiteboards dual-sided as well.

We couldn’t be happier. We’ll post an update in a few weeks to let you know how they’re standing up.

Using Packer for faster scaling

We have been playing with Packer for a little while, thanks to the behest of a friend of Flyclops. While I haven’t completely explained how we manage our server stack(s), I’ve hinted before at the following configuration for any of our ephemeral servers that automatically scale up and down with load:

  • AWS CloudFormation templates
  • Vanilla Ubuntu servers
  • Just-in-time provisioning of specific servers at scale/deploy-time using CloudInit

This worked well for a good amount of time. Along with the use of Fabric, a single command-line compiles all of our CloudFormation templates, provisioning scripts, variables for scale alarms and triggers, update our stack parameters and notifications, and push new code to running servers. All of our scripts, config files for servers, etc., are Jinja2 templates that get compiled before being uploaded to Amazon.

The biggest bottle-neck was our scale-up. The process worked like this:

  1. Boot a new EC2 instance at desired size, using a vanilla Ubuntu AMI
  2. Attach server to the load balancer with a long initial wait period
  3. Run a series of in-order shell scripts (via CloudInit) to build the server in to whatever it needs to be
  4. Load balancer detects a fully-functional server
  5. Traffic begins to be routed to the new server

The beauty of our system was also its biggest fault. Our servers never got stale, and as they scaled up and down, benefitted from bug fixes to the core OS. Every server that started handling work was freshly built, and any error in the build process would kill the instance and boot a new server – so the fault-tolerance was fantastic, too.

But the downsides were many as well. The provisioning process was highly optimized and still took over 6 minutes to get a server from boot to answering traffic. Provisioning on the fly required access to several external services (APT repositories, PyPi, Beanstalk/Github, and so-on. Any service disruption to any of these would cause a failed build (and we were unable to scale until the external service issue had been resolved).

Caching as a bad first attempt

We went through several rounds of attempting to remove external dependencies to scaling – from committing code to the pip2pi repository to include the on-the-fly creation of an S3 bucket to serve as a PyPi mirror to bundling git snapshots, etc.

Eventually, the staleness we were trying to avoid was now possible in many other services we were attempting to force caches of on to the AWS network, and we were maintaining a lot more code.

Enter Packer

Packer brought us almost back to our roots. By ripping out massive amounts of support code and adding only a bit to include Packer, we were able to recreate on-the-fly builds of local VMs that were almost identical to those running in our stack. From then, it was a very easy process to pack the VM in to an AMI, and use that AMI, not a vanilla Ubuntu AMI, in scaling. Here’s the entirety of the pack command in our code (note, this is a Fabric command, and uses Fabric function calls):

#
#
@task
@runs_once
def pack(builder=None):
    '''
    Pack for the current
    environment and name
    '''
    if not builder:
        abort('Please provide a builder name for packing')

    _confirm_stack()
    _get_stack_context()

    env.template_context['builder'] = builder

    # This function compiles all of our provisioning
    # shell scripts
    _compile_userdata()

    # This compiles our packer config file
    packer_dir = _get_packer_folder()
    packer_file = join(packer_dir, 'packer.json')
    compiled_packer_file = join(packer_dir, 'packer_compiled.json')
    print 'Using packer file: %s' % packer_file

    # Get the compiled packer file
    packer_compiled_contents = _get_jina2_compiled_string(packer_file)
    with open(compiled_packer_file, 'w') as f:
        f.write(packer_compiled_contents)

    # Move in to the directory and pack it
    local("cd %s && packer build -only=%s packer_compiled.json && cd .." % (
        packer_dir,
        builder
    ))

    # Clean up
    local("rm %s/*" % join(packer_dir, 'scripts'))
    local("rm %s" % compiled_packer_file)
    local("rm -rf %s" % join(packer_dir, 'packer_cache'))

Of course, we need to grab the AMI that packer just created. You can use tagging and all that fun, but here’s a quick way to do it with boto:

#
#
def _get_ami_for_stack():
    ec2_conn = boto.connect_ec2(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
    images = ec2_conn.get_all_images(
        owners='self',
        filters={
            'name': '*%s*' % env.environment  # This might be 'production' or 'staging'
        }
    )

    images.sort(key=lambda i: i.name)
    images.reverse()

    ami_image = images[0]
    return ami_image.id

Now we place the Packer-created AMI instance ID in to our CloudFormation template (in the AutoScaleGroup section) and we’re off to the races.

The benefits are plentiful:

  • AMI is built at pack time, not scale time
  • Can still use CloudInit to build, just happens at pack-time, not scale time
  • No reliance on any external services to bring another server online
  • Updated AMI is a command-line statement away
  • Can pack local VMs using identical code
  • Removal of all caching code

But most importantly…

Because of the fact that no code has to run for our services to scale up, we’ve gone from 6 minute scale-up times to approximately the time it takes for the EC2 instance to boot and the ELB to hit it once. That is currently < 60 seconds.

Packer has been a fantastic win for speeding up and simplifying deployments for us. I recommend you look in to it for your own stack.

Video games don’t have to be crazy technical

We’ve been working for months porting our most successful game to Unity – it lets us move from iOS only to iOS and Android, plus opens up platforms like Facebook, Windows Phone, etc. It’s a huge technical undertaking.

To add to that, the game doesn’t run without a large server farm, most recently requiring us to rethink our database strategy to manage “large” data as we grow to millions of players.

But does a video game really require huge technical chops?

Na.

About a week before the Philly Game Forge game showcase (Philly Game Forge being the new game co-working space in Philadelphia started by the ever-amazing Cipher Prime) Jake had an idea for a social game. It was a bit of genius in that it wasn’t a technical challenge as much as it was a psychological challenge, and was the perfect party game – and this party had hundreds of people.

Still, it required implementation, technically speaking. In his infinite wisdom, Jake asked me if I could build the game – on the web – in 6 hours. After berating him about the head for a while, I agreed to the challenge.

The basics are this – Django, Blueprint, a little bit of knowledge about meta tags for mobile development, and the ever stunningly amazing Webfaction – and we had a game. In 6 hours. In fact, in this video you can see me finishing the game as the guests are appearing.

And so was born “Colors” by Flyclops.

The best part is this – of the whole crop of amazing games shown at the Philly Game Forge showcase, Colors was one of the most popular of the night. Why? It was engaging and social, encouraged spying and espionage and even a bit of cheating… with so little actual code.

So while our flagship product continues to be a conglomeration of technologies spanning horizontally scalable databases on UNIX to Unity development for iOS and Android to front-end servers handling 500+ requests per second and more… an amazing game doesn’t require 6 months of development.

And while we could make “Colors” (to be released for parties everywhere in the near future) in to a hugely technical endeavor, it’s the beauty of game design that allowed us to build it in 6 hours.

So yeah, this is a tech blog, but yeah – happiness in gaming doesn’t require technological wizardry – sometimes it just requires an insane amount of love… and 6 or so hours of development.

Shell scripting: logging fractional elapsed time

I wanted a more robust way to track the amount of time building certain parts of our servers takes. As I describe in another post (and want to dramatically expand on in a future post), we use a series of logically separated shell scripts to build our servers from vanilla instances. So here’s a bit of a tip about logging elapsed time.

Here’s a basic shell script:

#!/bin/sh
START=`date +%s`
 
# Do stuff here
 
END=`date +%s`
ELAPSED=$(( $END - $START ))

Seems smart enough – and then I can take the ELAPSED variable and log it or whatever. But this method only works in whole seconds, and that’s silly. So this is a bit more elaborate, but much more exact:

#!/bin/sh
START=`date +%s%N`
 
# Do stuff here
 
END=`date +%s%N`
ELAPSED=`echo "scale=8; ($END - $START) / 1000000000" | bc`

This is the difference between “2 seconds” and “2.255322 seconds”, which is important to us, since some shell scripts may only take a fraction of a second.

So what’s going on?

First, we change the date format to

date +%s%N

Adding the %N gets us nanoseconds instead of seconds – as whole numbers, that is. (Note: %N doesn’t work on BSD variants.) So we divide by 1 billion, and pipe the calculation string through bc, a unix calculator, and it will return our seconds difference to 8 decimal places.

This changes my output from:

Backup complete in 2 seconds

to:

Backup complete in 2.83732828 seconds

Much better :)

Replacing “pip bundle”

2014-02-06: This solution was a band-aid house of cards, and we eventually ended up moving to Packer, which removes the need to bundle Python dependencies at all.

Our API application is written in Python, using a lightweight framework called Flask, sitting behind nginx proxying to uWSGI. As part of our code-commit process, we bundled all of our Python library dependencies and committed them to our code repo – this removes the dependency on PyPi being up and available (which it often isn’t).

But bundle is going away.

Pip’s bundle command was not very popular, and that (and a lack of ongoing development) has led to the pip maintainers deciding to deprecate the code. When I last bundled our dependencies, I received the following message in the terminal:

###############################################
##                                           ##
##  Due to lack of interest and maintenance, ##
##  'pip bundle' and support for installing  ##
##  from *.pybundle files is now deprecated, ##
##  and will be removed in pip v1.5.         ##
##                                           ##
###############################################

Well that sucks.

So, not really knowing the alternatives and not finding much with LMGTFY, I turned to Stack Overflow with this question: Is there a decent alternative to “pip bundle”?

TL;DR – check out pip wheel.

Basically, wheel creates (or installs) binary packages of dependencies. What we want to do is create a cache of our dependencies and store them in our source repo.

NOTE: Because the packages are pre-compiled for those that require compiling (think MySql-python, etc.) wheel will create platform-specific builds. If you are developing on OS X and using x86_64 Linux in production, you’ll have to cache your production binaries from Linux, not OS X.

So… Here are the steps.

Continuing to rely on requirements.txt

We need to modify our file just a little bit. If you have a pure dependency list, you’re good to go. If you have any pointers to code repositories (think git) you need a minor change. Here’s the before and after:

riak==2.0.0
riak-pb==1.4.1.1
pytest==2.3.5

-e git+git@your.gitrepo.com:/repo0.git#egg=repo0
-e git+git@your.gitrepo.com:/repo1.git#egg=repo1

Wheel doesn’t respect the -e flag, and has trouble with SSH based git links, so go ahead and put in the https equivalents. There is also no need to name the “egg” as wheel is basically a replacement for eggs.

riak==2.0.0
riak-pb==1.4.1.1
pytest==2.3.5

git+https://your.gitrepo.com/repo0.git
git+https://your.gitrepo.com/repo1.git

Cache the dependencies as wheel packages

Now you can call pip to download and cache all of your dependencies. I like to put them in local/wheel (they are .whl, or “wheel” bundles, basically glorified zip files). You will require the “wheel” package for this part, but not for installing the already bundled packages.

Due to the pre-compiled nature of wheel, package names are ended in whichever platform they were compiled. For pure-python packages, which can be installed anywhere, packages end in -none-any.whl. For instance, the boto package for Amazon AWS:

boto-2.3.0-py27-none-any.whl

However, MySql-python and the like, that require binary compilation will result in file names that are platform specific. Note the difference for OS X and Linux (in our case, Ubuntu 13.04):

MySQL_python-1.2.4-cp27-none-linux_x86_64.whl
MySQL_python-1.2.4-cp27-none-macosx_10_8_x86_64.whl

To cache the wheel packages, run the following line:

$ pip install wheel && pip wheel --wheel-dir=local/wheel -r requirements.txt

This isn’t nearly as convenient as say, using pip bundle to create a single requirements.pybundle file, but it works just fine.

Add to git, or whatever you use

Commit the local/wheel directory to your repo, so the bundles are available for you to install at production-time.

Installing on production servers

This is where we ran in to problems. Despite being cached, any git-based packages still go out to git when you run the following command on the production server:

$ pip install --use-wheel --no-index --find-links=local/wheels -r requirements.txt

This breaks our desire to not have to rely on any external service for installing requirements. What’s worse is that the package in question is in fact in the ./local/wheel directory. So a little bit of command-line magic, and installing the packages by name works just as well:

$ ls local/wheel/*.whl $1 | while read x; do pip install --use-wheel --no-index --find-links=local/wheels $x; done

This basically lists the local/wheel directory, and passes the results in to pip install --use-wheel which also has the --find-links argument that tells pip to look for any dependencies in the local/wheel folder as well. --no-index keeps pip from looking at PyPi.

NOTE: If you have multiple binary packages for different platforms, you’ll have to modify the command above to ignore binary packages that are not built for the specific platform you’re installing to.

Final word

Those are the basics. This can be automated in all sorts of ways – even zipping up all the wheel files in to a single file to get you pretty close to a .pybundle file. It’s up to you – but hopefully this will help as you are torn away from the arms of pip bundle.

2013-09-23: Edited to better represent the binary nature of pre-compiled packages and their platform-specificness.

Unite 2013 (Unity Conference) Post-mortem

About a year ago, we decided to port our iOS game, Domino!, to multiple platforms. Obviously Android was up first, and we had to choose between a native port, using Java and let’s say Cocos2Dx (or some other 2D OpenGL framework) to port our iOS Cocos2D code to, or completely rewrite the entire application from the ground-up in Unity.

We chose Unity.

It’s not hard to see why. When I say “it’s a ground-up rewrite” in Unity, it would also have been a ground-up rewrite for a native Android port, but with none of the benefits. No code sharing, having to maintain staff with wildly disparate areas of expertise, and so on.

With Unity, we require less people, can share and grow our knowledge together, bring multiple platforms in to release parity (iOS and Android first, but maybe Facebook, Windows Phone, even Blackberry) later – all, again, with a single codebase.

So we hopped a plane to Vancouver, grabbed a cherry Airbnb penthouse apartment (cheaper than two hotel rooms) and came to Unite 2013. There have been great talks, we were given a look at their extended roadmap (to which we are bound to secrecy), and we saw the glorious 4.3 beta with native 2D tools – the lack of which has caused us headaches along the way, and will make our development roadmap much cleaner and our codebase smaller.

Pixel orca statue by the Vancouver Convention Center.

Pixel orca statue by the Vancouver Convention Center.

Unity is screaming forward in a hugely positive direction, and our decision to move our entire front-end development roadmap to it was validated this week. It’s not that we didn’t know pretty well that we were making the right decision, but it’s good to know the platform you’re buying in to has the same ideas you do, the same goals and desires you want them to have, and more.

If you’re thinking about game development, evaluate all your options, then grab Unity and get building. :)

Building Riak on OmniOS

When we first set out about using Riak, we chose to develop against the Yokozuna release – an alpha-stage integration of Solr in to the core of Riak (but not “riak-core” as it were), replacing the never-quite-amazing Riak Search. Although still in alpha, we didn’t have time to re-architect our server infrastructure and API once against Riak Search, and then again against Solr once Riak 2.0 is released.

We followed the basic constructs of getting it installed on Ubuntu Linux 12.10 or 13.04 (both of which we run in production for our API servers), and it worked fine.

But the more we researched about the best way to back up Riak, the more we figured out that taking a node down, copying its data (or, in our case, we could snapshot an EBS volume in AWS), spinning the node back up, and then waiting for it to catch up again (and then doing that in a rolling fashion for N nodes) was just silly.

So after chatting again with the awesome team at Wildbit about some of their solutions, we decided to give OmniOS a try. It’s an operating system built by OmniTI on top of illumos, a fork of the relatively-defunct Open Solaris project, by many of the people that worked on Open Solaris in its day.

Needless to say, UNIX is the mainframe of our day. It’s incredibly rock-solid, and every release is basically a Long Term Support release (LTS) because, short of serious security updates, an installation will run solidly – well – forever.

But we didn’t just hop from Linux to UNIX because I’m nostalgic for the Sun Sparcstations and Xterms in the basement of the CoRE building at Rutgers University.

Well, maybe a little bit.

But mostly, it was for ZFS.

ZFS allows us to take live-snapshots of any file system (including intelligent diffs instead of full snapshots) and ship them off to a backup server. It means zero downtime for a node with full backups. Wildbit has some of its servers snapshot once per second – that’s how granular you can get with your backups. Restoration is incredibly easy, and did I mention there is zero node downtime?

I’ll write more about our final ZFS snapshot setup when we’ve finalized the architecture. But first, we had to get Riak building on OmniOS – both for development and production.

Before I just spit code on to the page, you should know the biggest issue we had to tackle was that everything was being built 32 bit. OmniOS runs with dual instruction set support (x86 and x86-64), and in many scenarios, defaults to 32 bit. Once we figured out that’s why we were being denied disk access (we were getting “Too many files open” errors even with our open file limit set to 65536) and running in to other issues, getting a final build script became much easier.

Let’s get started

NOTE: Many, many thanks go to to the people on the #riak and #omnios IRC channels on Freenode, as well as the Riak Users and OmniOS Discuss mailing lists for all of their help.

We’ll start by installing the following packages:

NOTE: Thanks to Eric Sproul from OmniTI for pointing out that they maintain root certificates in a package, I’ve moved the CA certificate install in to this list.

# Install compilers, gmake, etc.
pkg install pkg:/developer/versioning/git
pkg install pkg:/developer/build/gnu-make
pkg install pkg:/developer/gcc46
pkg install pkg:/system/header
pkg install pkg:/system/library/math/header-math
pkg install pkg:/developer/object-file
pkg install pkg:/network/netcat
pkg install pkg:/web/ca-bundle

These packages replace things like build-essentials, et. al. in Linux. We’re using GCC 4.6 instead of 4.7 (4.7.2 on OmniOS currently) because of a compilation bug that is fixed in 4.7.3 and 4.8 (but there’s no package for either of those versions yet).

Next, download and install Java JDK from Oracle. Make sure to use Oracle’s JDK, not OpenJDK, and you must, must, must download the 64 bit versions. We downloaded the JDK i586 and x64 .gz files and put them in a private S3 bucket for ease of access.

# Download files from S3 in to folder such as /opt/<YOUR NAME>/java
gzip -dc jdk-7u25-solaris-i586.gz | tar xf -
gzip -dc jdk-7u25-solaris-x64.gz | tar xf -

We have all these fun tools, so let’s set up some PATH’ing:

NEWPATH=/opt/gcc-4.6.3/bin/:/usr/gnu/bin:/opt/flyclops/java/jdk1.7.0_25/bin/amd64:/opt/flyclops/java/jdk1.7.0_25/bin:/opt/omni/bin:$PATH 
export PATH=$NEWPATH
echo "export PATH=$NEWPATH" >> /root/.profile

Now we’ll create an unprivileged user to run Riak under, and the directory for holding Riak. We create a directory at /riak, which I’m sure we’ll get yelled at for in the comments:

# Make riak user
useradd -m -d /export/home/riak riak
usermod -s /bin/bash riak
echo "export PATH=$NEWPATH" >> /export/home/riak/.profile

# Create the riak dir
mkdir /riak
chown riak:other /riak

Our final build tool is kerl – we use it for easily building and installing Erlang.

# Install kerl
curl -O --silent https://raw.github.com/spawngrid/kerl/master/kerl; chmod a+x kerl
mv kerl /usr/bin

Last but not least, we want to make sure our open file limit is nice and high. There are two ways to do this in OmniOS, this shows both. The first is user-specific and more correct, the other is system-wide:

# Add project settings for riak
sudo projadd -c "riak" -K "process.max-file-descriptor=(basic,65536,deny)" user.riak

# Add limits to /etc/system
bash -c "echo 'set rlim_fd_max=65536' >> /etc/system"
bash -c "echo 'set rlim_fd_cur=65536' >> /etc/system"

# Set this session's limit high
ulimit -n 65536

Building Erlang and Riak as the `riak` user

We want to run the rest of the commands as the unprivileged riak user. If you’re going to do this by hand, you can simply call sudo su - riak. If you’re doing this via a shell script, create a second script for building Erlang and Riak, and launch it via your first being run as root. Example:

# Run stage 2 builds as riak user
sudo -u riak sh /path/to/riak_build_stage2.sh

Now that we’re riak we can build Erlang. As of this writing, R15B01 is required for Riak 1.4.1, and R15B02 is required for the latest Yokozuna build. Choose appropriately. Please note the --enable-m64-build flag – this is very important. It (along with other commands further down) will ensure a 64 bit build, not a 32 bit build.

# Build Erlang
# MAKE SURE WE USE --enable-m64-build
echo 'KERL_CONFIGURE_OPTIONS="--disable-hipe --enable-smp-support --enable-threads --enable-kernel-poll --enable-m64-build"' > ~/.kerlrc
kerl build R15B02 r15b02

# Install and activate Erlang
mkdir -p /riak/erlang/r15b02
kerl install r15b02 /riak/erlang/r15b02
. /riak/erlang/r15b02/activate

Finally, download (however you choose) a Riak source package and build it. Here we’ll download the Yokozuna 0.8.0 release and build it. Again, note that we’re setting CFLAGS to use -m64 to ensure a 64 bit build.

# Get riak
cd /riak
wget -nv http://data.riakcs.net:8080/yokozuna/riak-yokozuna-0.8.0-src.tar.gz
tar zxvf riak-yokozuna-0.8.0-src.tar.gz
mv riak-yokozuna-0.8.0-src riak

# Build riak
cd /riak/riak
CC=gcc CXX=g++ CFLAGS="-m64" make

# "Install" riak, using "make rel" or "make stagedevrel"
CC=gcc CXX=g++ CFLAGS="-m64" make rel

There you have it. Riak (and if you like, Yokozuna) on OmniOS (currently, R151006p). We’ve done all the build experimenting, so hopefully this will save you some time!

Taking the Riak plunge

As with most companies that approach the 100 million-row count in MySQL, we’re starting to hit growing pains. We’re also a bit abnormal in our DB needs – our load is about 45% write. This is rather unfortunate for us – read replicas are way less useful for our particular needs, especially as you still have to have a single write-master. We realize there are master-master setups, but the complications are too much to deal with.

So we’ve been adding RAM as our indexes grow (currently around 23.5G and growing), but vertical scaling forever is obviously a non-starter. We’re also planning for a huge bump in data when we release to Android, so we started looking around at other solutions.

This conversation doesn’t go far before we start talking about NoSQL, but we don’t care about the “SQL-iness” of a solution. What we care about are:

  • Horizontal scalability
  • Speed
  • Keysets larger than RAM not killing the DB
  • Ease of scaling (add a node, get performance)
  • Ease of management (small ops team)
  • Runs well on AWS (EC2)
  • Great community and support
  • Load-balanced

We spoke with our good friends at Wildbit (who, it so happens, power all of our transactional emails in Postmark and host all of our code in Beanstalk) about their scaling problems. They already went through this exercise (and continue to do so as they host an unbelievable amount of data) and were able to point us around a bit.

While PostgreSQL was on our radar for a while, the management of highly-scalable relational databases sent our heads spinning. Right now, our ops team is very, very small. So, we started looking at alternative solutions. Here’s what we came up with:

CouchDB and MongoDB

You can’t have an intelligent conversation if you aren’t at least slightly versed on CouchDB and MongoDB. Things they have in common are:

  • Document store with “Collections” (think tablespace)
  • Relatively easy to code for
  • Great libraries
  • Great community
  • Keyset must be in RAM
  • Swapping can bring down the DB instantly
  • Adding capacity requires understanding sharding, config servers, query routers, etc.

This is hardly a comprehensive list, but there were a couple non-starters for us.

CouchDB’s data files require compaction, so that’s no good (an ops problem),  and compaction has been known to fail silently, which is even worse for non-experts.

We talked with several companies we’re friendly with, and MongoDB became less and less favorable just in the ops story. While each company has its own story to tell, none were particularly good.

But the biggest roadblock for us is the keyset must remain in RAM. After hearing horror stories of what happens when keysets swapped to disk (even SSDs, although it’s a much better story with SSDs) we moved on. We can’t afford EC2 instances with dedicated SSD drives, and need our DB to keep serving when we reach capacity or are adding a new node to the cluster.

(I should mention Wildbit uses CouchDB for some things, and are quite happy with it, but they have a much more significant ops team. Read about their decision to move off of MongoDB and on to CouchDB with the help of Cloudant.)

UPDATE: Originally, this write-up stated that compaction could cause data loss. Adam Kocoloski from Cloudant took a bit of issue with this in the comments below. To his credit, while you can lose earlier versions of documents on compaction, as long as you’re doing manual conflict resolution, it shouldn’t be an issue. He also notes that CouchDB has gotten better at graceful degradation of keyspace overrunning RAM capacity. 

Casandra and Hadoop/HBase

Both of these databases crossed our plates as well because they meet a lot of the things we want… They scale to incredible sizes, deal with huge datasets easily, etc. But the more research we did, the more we understood that Casandra would require a hefty amount of ops. Adding a node to the cluster is not a simple task, and neither is getting a cluster set up properly. While it has absolutely brilliant parts, it also has fear-inducing parts.

Hadoop came close, but also had heavy requirements when it came to understanding the underlying architecture, and is really built to deal with datasets in the terrabyte-petabyte range – and that’s not us.

Others

We also looked at:

  • PostgreSQL  – it has a nice new key-value column format called hstore, and the idea of getting the best of both worlds – relational and KV in the same solution – was really tempting. But the ops requirements are not trivial, and it runs in to many of the same scaling issues that are making us hop off of MySQL
  • Neo4j – not entirely proven, and the graph features are not something we’d use heavily
  • And about a bajillion others.

DynamoDB

There is a seemingly obvious path with DynamoDB here. We’re pretty tightly coupled in to AWS’s infrastructure (because we like to be), and DynamoDB would seem to make sense – in fact, it was in the lead for most of the time. However, there are downsides.

Werner Vogel’s original post and the Dynamo Paper about DynamoDB stirred things up a lot. It’s integral to running Amazon.com, and many of its ideas make incredible amounts of sense. Huge pros came out:

  • “Unlimited” scale
  • Highly distributed
  • Managed – hides almost all ops, reducing load on our team
  • Highly available
  • Cost efficient
  • SSD-backed

These are some tasty treats. But the pricing is difficult to understand, and more difficult to plan for. Each “write” operation only accepts 1k of data, and our game save state is often over that – meaning each game update can require 4 “write” units or more. In order to not have DynamoDB spit errors at you, you have to “provision” a certain amount of reads/writes per second, so we’ll always be paying for the worst case scenario, and have the potential to get screwed if we ever get, say, featured by Apple on the iOS store. Even worse – scaling up your capacity can take hours, during which old capacity limits will be in place.

Further limitations include:

  • “Secondary indexes” cost more write units per save, meaning even small writes that have secondary indexes can quickly clog up your provisioned capacity
  • No MapReduce
  • No searching – you can set up CloudSearch, but this is additional cost, and its up to you to manually submit updates from DynamoDB to CloudSearch
  • No triggers
  • No links between data items
  • Proprietary
  • REALLY difficult to “take your data with you”
  • No counters (we do need to be able to replace “select count(*) from…” queries)
  • No TTL for entries

Finally, Riak (by Basho)

Despite all the positives of DynamoDB, the closed-source, proprietary nature of it, along with severely slow scaling limitations, and a lack of decent querying capabilities (without involving even more AWS offerings) made us take a solid look at Riak.

Riak ticked a lot of boxes. First, its philosophy is very closely tied to many of DynamoDB concepts we like. (Here’s their reprint of the Dynamo paper, with annotations. Not a PDF, even!) And then there are the easy ones:

  • Open source (Apache license)
  • Multiple backends (including a memory-only backend that we may even consider replacing memcached/Redis with)
  • Great community support
  • A plethora of libraries

But how about the big one? Indexes in RAM? Riak’s default backend – Bitcask – does require that the entire keyset fit in RAM. It’s the fastest of the backends, obviously, but doesn’t have any support for secondary indexes. So for two reasons, this won’t fit our needs.

However, we can use Google’s LevelDB (Riak LevelDB documentation here). It has secondary index support, and does not require that all our indexes fit in RAM. This is big for us, no, huge for us, because a lot of data becomes historical as time goes on. It’s important that indexes that are old (Least Recently Used, or “LRU”) can pass out of RAM. It’s not that we don’t want to have to keep machines upgraded, but we can’t have the DB come to a screeching halt if we page.

Yes, Riak is an eventual-consistency store, but most document or KV store databases are in a distributed environment.

Then there are these features, which add on to the niceties of Dynamo:

  • Map Reduce (albeit slow as heck, it’s available for data analysis)
  • Pre- and post-commit triggers for doing… stuff.
  • Link following
  • Counters
  • TTLs on data items
  • Buckets, which are kinda like namespaces (and not like tables)
  • The ability to use multiple backends at one time – which means you can tailor buckets by need.
  • We can easily “take our data with us” if we ever move to bare metal
  • SEARCH!

Minus secondary indexes, sometimes you want to QUERY for something. Riak has Riak Search, but enough about that. Let’s talk about Yokozuna – their codename for direct Solr integration. Unlike CloudSearch on top of DynamoDB, Solr is directly integrated in to the core workings of Riak. You create an index in Solr, associate it with a bucket, create the schema you want (or don’t, the defaults will pick out values dynamically) and it adds a post-commit “magic” trigger to a bucket that will instantly index an item when it is written or updated.

This way a little bit of brilliance lies. First, as Riak is a distributed database, Distributed Solr is used for the Solr integration. Secondly, by building a decent Solr schema, not only can you do away with many of your secondary-index needs, but in many instances, get the results you need from Solr directly, and never have to query Riak.

Wildbit does this a lot. While they started with a MongoDB/Elastic Search setup, their eventual move to CouchDB was in part because it fit so nicely in to their already-running Elastic Search solution. With Riak, we’re getting something very similar, and minus initial setup, takes almost no management to keep up to date. (Just in case something does get behind, Riak has a neat little feature called Active Anti-Entropy which will take care of it.)

Next we have ease of cluster management. Minus the cluster admin console which makes things even easier, cluster management is about as simple as it gets. Adding a node to a cluster consists of about two command-line steps. This means node management is insanely easy for us ops-short teams.

Riak’s architecture also allows for rolling server upgrades, rolling software updates, node replacement, and so on. Add on a few monitoring scripts (we’ll be using Hosted Graphite) and we’re good to go. (Fun bonus – Hosted Graphite uses Riak on the backend.)

Conclusion

Nothing is ever as easy as it seems, and Riak will obviously present roadblocks we haven’t prepared for. But the official IRC channel (#riak on Freenode) is almost always staffed with Basho employees, they reply to email quite quickly, and most interesting to us is that they have “indie” pricing for their enterprise services, which we’ll likely avail ourselves of in the future.

Over time, we’ll let you know how our migration is going, and maybe this post will help you when you’re ready to move from MySQL to a distributed, horizontally scalable database solution.