Modern Von Neumann machines, how do they work?
Modern Microprocessors - A 90 Minute Guide!. If you didn't find a peculiar joy in computer architecture classes or the canonical tomes on the topic by Patterson and Hennessey, this is the thing for you. It's a great dive into how modern processors work, what the design challenges and trade-offs are, and what you need to know as a software developer.
Totally unrelated: when I interned at Texas Instruments, my last project was writing tests for a pre-silicon DSP. Because there were no test devices, I had to run my code against a simulator. It simulated several million gates of logic and output the result of my program as the wires that come out of the processor registers. This was fun, again in a way peculiar to my interest, at the time, in being a hardware designer/driver hacker. Let me tell you, every debugging tool you will ever see is better than inspecting hex values coming out of registers.
Anyway, these programs ran super slow, each run took about an hour. One day I did the math and figured out the simulator was basically running at 100 hz. Not kilohertz or megahertz. One hundred hertz. So, yeah. In the snow, uphills, both way.
Changing legacy code, made less painful
Rescuing Legacy Code by Extracting Pure Functions. Come across strange, pre-existing code. Decide you need to change it. Follow the pattern described herein. Apply TDD afterwards. I so wish someone had shown me this technique years and years ago. Also, Composed Method (from Smalltalk Best Practice Patterns) is so great, I can't even put it into words.
Cassandra at Gowalla
Over the past year, I’ve done a lot of work making Cassandra part of Gowalla’s multi-prong database strategy. I recently spoke at Austin on Rails on this topic, doing a sort of retrospective on our adoption of Cassandra and what I learned in the process. You can check out the slide deck, or if you’re a database nerd like me, dig into the really nerdy details below.
Why does Gowalla use Cassandra?
We have a few motivations for using Cassandra at Gowalla. First off, it’s become out database of choice for applications with relatively fixed query patterns that, for us to succeed, need to handle a rapidly growing dataset. Cassandra’s read and write paths are optimized for these kinds of applications. It’s good at keeping the hot subset of a database in memory while keeping queries that require hitting disk pretty quick too.
Cassandra is also great for time-oriented applications. Any time we need to fetch data based primarily on some sort of timestamp, Cassandra is a great fit. It’s a bit unique in this regard, and that’s one of the main reasons I’m so interested in Cassandra.
Cassandra is a Dynamo-style database, which yields some nice operational aspects. If a node goes down over night, we don’t take an availability hit; the ops people can sleep through the night and fix it later. The Cassandra developers have also done a great job of eliminating all the cases where one need to an entire Cassandra cluster at one time, resulting in downtime.
When does Gowalla not use Cassandra?
I don’t think Cassandra is all that great for iterating on prototypes. When you’re not sure what your data or queries will end up looking like, it’s hard to build a schema that works well with Cassandra. You’re also unlikely to need the strengths that a distributed, column-oriented database offers at that stage. Plus, there aren’t any options for outsourced Cassandra right now, and early-stage applications/businesses rarely want to devote expertise to hosting a database.
Applications that don’t grow data quickly, or can fit their entire dataset in memory on a pair of machines doesn’t play to Cassandra’s strengths either. Given that you can get a machine with a few dozen gigabytes of memory for the cost of rent in the valley, sometimes it does pay out to scale vertically instead of horizontally as Cassandra encourages.
Cassandra applications at Gowalla
We have a handful of applications going that use Cassandra:
- Audit: Stores ActiveRecord change data to Cassandra. This was our training-wheels trial project where we experimented with Cassandra to see if it was useful for us. It was incrementally deployed using rollout and degrade. Worked well, so we proceeded.
- Chronologic: This is an activity feed service, storing the events and timelines in Cassandra. It started off life as a secondary index cache, but became a system of record in our latest release. It works great operationally, but the query/access model didn’t always jive with how web developers expected to access data.
- Active stories: We store “joinability” data for users at a spot so we can pre-merge stories and prevent proliferation of a bunch of boring, one-person stories. This was built by Brad Fults and integrated in one pull request a few weeks before launch. The nice thing about this one was that it was able to take advantage of Cassandra’s column expiration and fit really nicely into Cassandra’s data model.
- Social graph caches: We store friend data from other systems so we can quickly list/suggest friends when they connect their Gowalla profile to Facebook or Twitter. This started life on Redis, but the data was growing too quickly. We decoupled it from Redis and wrote a Cassandra backend over a few days. We incrementally deployed it and got Redis out of the picture within two weeks. That was pretty cool.
What worked?
- Stable at launch. A couple weeks before launch, I switched to “devops” mode. Along with Adam McManus, our ops guy, we focused on tuning Cassandra for better read performance and to resolve stability problems. We ended up bringing in a DataStax consultant to help us verify we were doing the right things with Cassandra. The result of this was that, at launch, our cluster held up well and we didn’t have any Cassandra-related problems.
- Easy to tune. I found Cassandra interesting and easy to tune. There is a little bit of upfront research in figuring out exactly what the knobs mean and what the reporting tools are saying. Once I figured that out, it was easy to iteratively tweak things and see if they were having a positive effect on the performance of our cluster.
- Time-series or semi-granular data. Of the databases I’ve tinkered with, Cassandra stands out in terms of modeling time-related data. If an application is going to pull data in time-order most of the time, Cassandra is a really great place to start. I also like the column-oriented data model. It’s great if you mostly need a key-value store, but occasionally need a key-key-value store.
What would we do differently next time?
- Developer localhost setups. We started using Cassandra in the 0.6 release, when it was a giant pain to set up locally (XML configs). It’s better now, but I should have put more energy into helping the other developers on our team getting Cassandra up and working properly. If I were to do it again, I’d probably look into leaning on the install scripts the cassandra gem includes, rather than Homebrew and a myriad of scripts to hack the Cassandra config.
- Eventual consistency and magic database voodoo. Cassandra does not work like MySQL or Redis. It has different design constraints and a relatively unique approach to those constraints. In advocating and explaining Cassandra, I think I pitched it too much as a database nerd and not enough as “here’s a great tool that can help us solve some problems”. I hope that CQL makes it easier to put Cassandra in front of non-database nerds in terms that they can easily relate to and immediately find productivity.
- Rigid query model. Once we got several million rows of data into Cassandra, we found it difficult to quickly change how we represented that data. It became a game of “how can we incrementally rejigger this data structure to have these other properties we just figured out we want?” I’m not sure that’s a game you can easily win at with Cassandra. I’d love to read more about building evolvable data structures in Cassandra and see how people are dealing with high-volume, evolving data.
Things we’ll try differently next time
- More like a hash, less like a database. Having developed a database-like thing, I have come to the conclusion that developers really don’t like them very much. ActiveRecord was hugely successful because it was so much more effective than anything previous to it that tried to make databases just go away. The closer a database is to one of the native data structures in the host language, the better. If it’s not a native data structure, it should be something they can create in a REPL and then say “magically save this for me!”
- Better tools and automation. That said, every abstraction leaks. Once it does, developers want simple and useful tools that let them figure out what’s going on, what the data really looks like, tinker with it, and get back to their abstracted world as quickly as possible. This starts with tools for setting up the database, continues through interacting with it (database REPL), and for operating it (logging, introspection, etc.) Cassandra does pretty well with these tools, but they’re still a bit nerdy.
- More indexes. We didn’t design our applications to use secondary indexes (a great feature) because they didn’t exist just yet. I should have spent more time integrating this into the design of our services. We got bit a lot towards the end of our release cycle because we were building all of our indexes in the application and hadn’t designed for reverse indexes. We also designed a rather coarse schema, which further complicated ad-hoc querying, which is another thing non-database-nerds love.
What’s that mean for me?
Cassandra has a lot of strengths. Once you get to a scale where you’re running data through a replicated database setup and some kind of key-value database or cache, it makes sense to start thinking about Cassandra. There are a lot of things you can do with it, and it lets you cheat in interesting ways. Take some extra time to think about the data model you build and how you’ll change it in the future. Like anything else, build tools for yourself to automate the things you do repeatedly.
Don’t use it because you read a blog post about it. Use it because it fits your application and your team is excited about using it.
Pass interference: can't live with it, can't live without it.
Bill Barnwell on revamping defensive penalties. Pass interference is tough business in the NFL. It's one of the easiest calls to get wrong on the field (besides the myriad of missed holding calls), but the easiest to fix with a slow-motion camera. It's too easy for both sides to game it as well. There's some good ideas in here, but I think just making pass interference calls and non-calls is a simple first step.
The pitfalls of growing a team
Premature Ramp-up, Martin Fowler on the perils of building up a development team too quickly: loss of code cohesion, breakdown of communication, plus the business costs of on-boarding. The problem I'm more concerned with, when growing a software team, is maintaining culture.
Adding a new person to a team is a process of integrating the new person’s unique good qualities to the team’s existing culture. It’s critical to use their prior experiences to clean up the sharp edges of the existing team practice without accidentally integrating new sharp edges. It’s a careful balancing act of taking advantage of the beginner’s mind and cultural indoctrination. Both sides have to give and take.
If you grow too quickly, it’s very easy for this balancing act to get, well, out of balance. The new people are only indoctrinated and the team doesn’t learn, or the new people don’t understand the team and go about doing whatever they felt was successful at their previous gig.
Its common to focus on the difficulty of recruiting a team, but finding a culture match and growing that culture is equally, if not more, challenging.
A food/software change metaphor
Are You Changing the Menu or the Food? Incremental change, the food metaphor edition. It's about software and startups. But food too. Think "software" when he says "food". Just read it, OK?
Your frienemy, the ORM
When modeling how our domain objects map to what is stored in a database, an object-relational mapper often comes into the picture. And then, the angst begins. Bad queries are generated, weird object models evolve, junk-drawer objects emerge, cohesion goes down and coupling goes up.
It’s not that ORMs are a smell. They are genuinely useful things that make it easier for developers to go from an idea to a working, deployable prototype. But its easy to fall into the habit of treating them as a top-level concern in our applications.
Maybe that is the problem!
What if our domain models weren’t built out from the ORM? Some have suggested treating the ORM, and the persistence of our objects themselves, as mere implementation details. What might that look like?
Hide the ORM like you’re ashamed of it
Recently, I had the need to build an API for logging the progress of a data migration as we ran it over many million records, spitting out several new records for every input record. Said log ended up living in PostgreSQL1.
Visions of decoupled grandeur in my head, I decided that my API should be not leak its databaseness out to the user. I started off trying to make the API talk directly to the PostgreSQL driver, but that I wasn’t making much progress down that road. Further, I found myself reinventing things I would get for free in ActiveRecord-land.
Instead, I took a principled plunge. I surrendered to using an AR model, but I kept it tucked away inside the class for my API. My API makes several calls into the AR model, but it never leaks that ARness out to users of the API.
I liked how this ended up. I was free to use AR’s functionality within the inner model. I can vary the API and the AR model independently. I can stub out, or completely replace the model implementation. It feels like I’m doing OO right.
Enough of the suspense, let’s see a hypothetical example
User model. Everyone has a name, a city, and a URL. I can all do this in my sleep, right?
I start with by defining an API. Note that all it knows is that there is some object called Model
that it delegates to.
class User
attr_accessor :name, :city, :url
def self.fetch(key)
Model.fetch(key)
end
def self.fetch_by_city(key)
Model.fetch_by_city(key)
end
def save
Model.create(name, city, url)
end
def ==(other)
name == other.name && city == other.city && url == other.url
end
end
That’s a pretty straight-forward Ruby class, eh? The RSpec examples for it aren’t elaborate either.
describe User do
let(:name) { "Shauna McFunky" }
let(:city) { "Chasteville" }
let(:url) { "http://mcfunky.com" }
let(:user) do
User.new.tap do |u|
u.name = name
u.city = city
u.url = url
end
end
it "has a name, city, and URL" do
user.name.should eq(name)
user.city.should eq(city)
user.url.should eq(url)
end
it "saves itself to a row" do
key = user.save
User.fetch(key).should eq(user)
end
it "supports lookup by city" do
user.save
User.fetch_by_city(user.city).should eq(user)
end
end
Not much coupling going on here either. Coding in a blog post is full of beautiful idealism, isn’t it?
“Needs more realism”, says the critic. Obliged:
class User::Model < ActiveRecord::Base
set_table_name :users
def self.create(name, city, url)
super(:name => name, :city => city, :url => url)
end
def self.fetch(key)
from_model(find(key))
end
def self.fetch_by_city(city)
from_model(where(:city => city).first)
end
def self.from_model(model)
User.new.tap do |u|
u.name = model.name
u.city = model.city
u.url = model.url
end
end
end
Here’s the first implementation of an actual access layer for my user model. It’s coupled to the actual user model by names, but it’s free to map those names to database tables, indexes, and queries as it sees fit. If I’m clever, I might write a shared example group for the behavior of whatever implements create
, fetch
, and fetch_by_city
in User::Model
, but I’ll leave that as an exercise to the reader.
To hook my model up when I run RSpec, I add a moderately involved before
hook:
before(:all) do
ActiveRecord::Base.establish_connection(
:adapter => 'sqlite3',
:database => ':memory:'
)
ActiveRecord::Schema.define do
create_table :users do |t|
t.string :name, :null => false
t.string :city, :null => false
t.string :url
end
end
end
As far as I know, this is about as simple as it gets to bootstrap ActiveRecord outside of a Rails test. So it goes.
Let’s fake that out
Now I’ve got a working implementation. Yay! However, it would be nice if I didn’t need all that ActiveRecord stuff when I’m running isolated, unit tests. Because my model and data access layer are decoupled, I can totally do that. Hold on to your pants:
require 'active_support/core_ext/class'
class User::Model
cattr_accessor :users
cattr_accessor :users_by_city
def self.init
self.users = {}
self.users_by_city = {}
end
def self.create(name, city, url)
key = Time.now.tv_sec
hsh = {:name => name, :city => city, :url => url}
users[key] = hsh
users_by_city[city] = hsh
key
end
def self.fetch(key)
attrs = users[key]
from_attrs(attrs)
end
def self.fetch_by_city(city)
attrs = users_by_city[city]
from_attrs(attrs)
end
def self.from_attrs(attrs)
User.new.tap do |u|
u.name = attrs[:name]
u.city = attrs[:city]
u.url = attrs[:url]
end
end
end
This “storage” layer is a bit more involved because I can’t lean on ActiveRecord to handle all the particulars for me. Specifically, I have to handle indexing the data in not one but two hashes. But, it fits on one screen and its in memory, so I get fast tests at not too much overhead.
This is a classic test fake. It’s not the real implementation of the object; it’s just enough for me to hack out tests that need to interact with the storage layer. It doesn’t tell me whether I’m doing anything wrong like a mock or stub might. It just gives me some behavior to collaborate with.
Switching my specs to use this fake is pretty darn easy. I just change my before
hook to this:
before { User::Model.init }
Life is good.
Now for some overkill
Time passes. Specs are written, code is implemented to pass them. The application grows. Life is good.
Then one day the ops guy wakes up, finds the site going crazy slow and see that there are a couple hundred million user in the system. That’s a lot of rows. We’re gonna need a bigger database.
Migrating millions of rows to a new database is a pretty big headache. Even if it’s fancy and distributed. But, it turns out changing our code doesn’t have to tax our brains so much. Say, for example, we chose Cassandra:
require 'cassandra/0.7'
require 'active_support/core_ext/class'
class User::Model
cattr_accessor :connection
cattr_accessor :cf
def self.create(name, city, url)
generate_key.tap do |k|
cols = {"name" => name, "city" => city, "url" => url}
connection.insert(cf, k, cols)
end
end
def self.generate_key
SimpleUUID::UUID.new.to_guid
end
def self.fetch(key)
cols = connection.get(cf, key)
from_columns(cols)
end
def self.fetch_by_city(city)
expression = connection.create_index_expression("city", city, "EQ")
index_clause = connection.create_index_clause([expression])
slices = connection.get_indexed_slices(cf, index_clause)
cols = hash_from_slices(slices).values.first
from_columns(cols)
end
def self.from_columns(cols)
User.new.tap do |u|
u.name = cols["name"]
u.city = cols["city"]
u.url = cols["url"]
end
end
def self.hash_from_slices(slices)
slices.inject({}) do |hsh, (k, columns)|
column_hash = columns.inject({}) do |inner, col|
column = col.column
inner.update(column.name => column.value)
end
hsh.update(k => column_hash)
end
end
end
Not nearly as simple as the ActiveRecord example. But sometimes it’s about making hard problems possible even if they’re not mindless retyping. In this case, I had to implement ID/key generation for myself (Cassandra doesn’t implement any of that). I also had to do some cleverness to generate an indexed query and then to convert the hashes that Cassandra returns into my User
model.
But hey, look! I changed the whole underlying database without worrying too much about mucking with my domain models. I can dig that. Further, none of my specs need to know about Cassandra. I do need to test the interaction between Cassandra and the rest of my stack in an integration test, but that’s generally true of any kind of isolated testing.
This has all happened before and it will all happen again
None of this is new. Data access layers have been a thing for a long time. Maybe institutional memory and/or scars have prevented us from bringing them over from Smalltalk, Java, or C#.
I’m just sayin’, as you think about how to tease your system apart into decoupled, cohesive, easy-to-test units, you should pause and consider the idea that pushing all your persistence needs down into an object you later delegate to can make your future self think highly of your present self.
This ended up being a big mistake. I could have saved myself some pain, and our ops team even more pain, if I’d done an honest back-of-the-napkin calculation and stepped back for a few minutes to figure out a better angle on storage. ↩
How to listen to Stravinsky's Rite of Spring
Igor Stravinsky’s The Rite of Spring is an amazing piece of classical music. It’s one of the rare pieces that was really revolutionary in its time. But in our time, almost one hundred years on, it doesn’t sound that different.
Music has moved on. We are used to the odd times of “Take Five” and the dissonant horns of a John Williams soundtrack. Music offending the status quo is nothing unheard of.
To enjoy Rite of Spring in its proper context, you have to forget all that. Put yourself in the shoes of a Parisian in 1913, probably well off. You probably just enjoyed a Monet and a coffee. But your world is changing. Something about workers revolting. A transition from manual labor to mechanical labor.
Now imagine yourself at the premier for this new ballet from Russia. You being a Parisian, you’re probably expecting something along the lines of Debussy or perhaps Debussy or Berlioz.
Instead, you get mild dissonance and then total chaos. The changing time signatures, the dissonance, the subject of virgin sacrifice. You’d probably riot too!
Practical words on mocking
Practical Mock Advice is practical:
Coordinator objects like controllers are driven into existence because you need to hook two areas of your application together. In Rails applications, you usually drive the controller into existence using a Cucumber scenario or some other integration test. Most of the time controllers are straightforward and not very interesting: essentially a bunch of boilerplate code to wire your app together. In these cases the benefit of having isolated controller tests is very little, but the cost of creating and maintaining them can be high.Includes the standard description of how to use mocks with external services. But more interesting are his ideas and conclusions on when to mock, how to mock caching implementations, and how to mock controllers/presenters/coordinator objects.A general rule of thumb is this: If there are interesting characteristics or behaviors associated with a coordinator object and it is not well covered by another test, by all means add an isolated test around it and know that mocks can be very effective.
Locking and how did I get here?
I've got a bunch of browsers tabs open. This is unusual; I try to have zero open. Except right now. I'm digging into something. I'm spreading ephemeral papers around on my epemeral desk and trying to make a concept, not ephemeral, at least in my head.
It all started with locking. It's a hard concept, but some programs need it. In particular, applications running across multiple machines connected by imperfect software and unreliable networks need it. And this sort of thing ends up being difficult to get right.
I've poked around with this before. Reading the code of some libraries that are implementing locking in a way that might come in handy to me, I check out some documentation that I've seen referenced a couple times. Redis' setnx
command can function as a useful primitive for implementing locks. It turns out (getset
) is pretty interesting too. Ohm, redis-objects and adapter-redis all implement locking using a combination of those two primitives. Then I start to dig deeper into Ohm; there's some interesting stuff here. Activity feeds with Ohm is relevant to my interests. I've got a thing for persistence tools that enumerate their philosophy. Nest seems like a useful set of concepts too.
I'm mentally wandering here. Let's rewind back to what I'm really after: a way to do locking in Cassandra. There's a blog post I came across before on doing critical sections in Cassandra, but it uses ZooKeeper, so that's cheating. Then I get distraced by a thing on HBase vs. Cassandra and another perspective on Cassandra that mentions but does not really focus on locking.
And then, paydirt. A wiki page on locking in Cassandra. It may be a little rough, and might not even work, but it's worth playing with. Turns out it's an adaptation of an algorithm devised by Leslie Lamport for implementing locking with atomic primitives. It uses a bakery as an analgoy. Neat.
Then I get really distracted again. I remember doozer, a distributed consensus gizmo developed by Blake Mizerany at Heroku. I get to reading its documentation and come across the protocol spec, which has an intriguing link to a Plan 9 manpage on the Plan 9 File Protocol. That somehow drives me to ponder serialization and read about TNetstrings.
At this point, my cup has overfloweth. I've got locking, distributed consensus, serialization, protocols, and philosophies all on my mind. Lots of fun intellectual fodder, but I'll get nowhere if I don't stick my nose into one of them exclusively and really try to figure out what it's about. So I do. Fin.
Refactor to modules, for great good
Got a class or model that’s getting a little too fat? Refactor to Modules. I’ve done this a few times lately, and I’ve always liked the results. Easier to test, easier to understand, smaller files. As long as you’ve got a tool like ctags
to help you navigate between methods, there’s no indirection penalty either.
That said, I’ve seen code that is overmodule’d. But, that almost always goes along with odd callback structures that obscure the flow-of-control. As long as you stick to Ruby’s method lookup semantics, it’s smooth sailing.
ZeroMQ inproc implies one context
I’ve been tinkering with ZeroMQ a bit lately. Abstracting sockets like this is a great idea. However, the Ruby library, like sockets in general, is a bit light on guidance and the error messages aren’t of the form “Hey dumbie, you do it in this order!”
Here’s something that tripped me up today. ZeroMQ puts everything into a context. If you’re doing in-process communication (e.g. between two threads in Ruby 1.9), you need to share that context.
Doing it right:
# Create a context for all in-process communication
>> ctx = ZMQ::Context.new
# Set up a request socket (think of this as the client)
>> req = ctx.socket(ZMQ::REQ)
# Set up a reply socket (think of this as the server)
>> rep = ctx.socket(ZMQ::REP)
# Like a server, the reply socket binds
>> rep.bind('inproc://127.0.0.1')
# Like a client, the request socket connects
>> req.connect('inproc://127.0.0.1')
# ZeroMQ only knows about strings
>> req.send('1')
=> true
# Reply/server side got the message
>> p rep.recv
"1"
=> "1"
# Reply/server side sends response
>> rep.send("urf!")
=> true
# Request/client side got the response
>> req.recv
=> "urf!"
Doing it wrong:
# Create a second context
>> ctx2 = ZMQ::Context.new(2)
# Create another client
>> req2 = ctx2.socket(ZMQ::REQ)
# Attempt to connect to a reply socket, but it doesn't
# exist in this context
>> req2.connect('inproc://127.0.0.1')
RuntimeError: Connection refused
from (irb):16:in `connect'
from (irb):16
from /Users/adam/.rvm/rubies/ruby-1.9.2-p180/bin/irb:16:in `'
I believe what is happening here is that each ZMQ::Context
gets a thread pool to manage message traffic. In the case of in-process messages, the threads only know about each other within the confines of a context.
And now you know, roughly speaking.
Booting your project, no longer a giant pain
So your app has a few dependencies. A database here, a queue there, maybe a cache. Running all that stuff before you start coding is a pain. Shutting it all down can prove even more tedious.
Out of nowhere, I find two solutions to this problem. takeup seems more streamlined; clone a script, write a YAML config. foreman is a gem that defines a Procfile
format for defining your project’s dependencies. Both handle all the particulars of starting your app up, shutting it down, etc.
I haven’t tried either of these because, of course, they came out the same week I bite the bullet and write a shell script to automate it on my projects. But I’m very pleased that folks are scratching this itch and hope I’ll have no choice but to start using one when it reaches critical goodness.