Your frienemy, the ORM

When modeling how our domain objects map to what is stored in a database, an object-relational mapper often comes into the picture. And then, the angst begins. Bad queries are generated, weird object models evolve, junk-drawer objects emerge, cohesion goes down and coupling goes up.

It’s not that ORMs are a smell. They are genuinely useful things that make it easier for developers to go from an idea to a working, deployable prototype. But its easy to fall into the habit of treating them as a top-level concern in our applications.

Maybe that is the problem!

What if our domain models weren’t built out from the ORM? Some have suggested treating the ORM, and the persistence of our objects themselves, as mere implementation details. What might that look like?

Hide the ORM like you’re ashamed of it

Recently, I had the need to build an API for logging the progress of a data migration as we ran it over many million records, spitting out several new records for every input record. Said log ended up living in PostgreSQL1.

Visions of decoupled grandeur in my head, I decided that my API should be not leak its databaseness out to the user. I started off trying to make the API talk directly to the PostgreSQL driver, but that I wasn’t making much progress down that road. Further, I found myself reinventing things I would get for free in ActiveRecord-land.

Instead, I took a principled plunge. I surrendered to using an AR model, but I kept it tucked away inside the class for my API. My API makes several calls into the AR model, but it never leaks that ARness out to users of the API.

I liked how this ended up. I was free to use AR’s functionality within the inner model. I can vary the API and the AR model independently. I can stub out, or completely replace the model implementation. It feels like I’m doing OO right.

Enough of the suspense, let’s see a hypothetical example

User model. Everyone has a name, a city, and a URL. I can all do this in my sleep, right?

I start with by defining an API. Note that all it knows is that there is some object called Model that it delegates to.

class User
  attr_accessor :name, :city, :url

  def self.fetch(key)
    Model.fetch(key)
  end

  def self.fetch_by_city(key)
    Model.fetch_by_city(key)
  end

  def save
    Model.create(name, city, url)
  end

  def ==(other)
    name == other.name && city == other.city && url == other.url
  end

end

That’s a pretty straight-forward Ruby class, eh? The RSpec examples for it aren’t elaborate either.

describe User do

  let(:name) { "Shauna McFunky" }
  let(:city) { "Chasteville" }
  let(:url) { "http://mcfunky.com" }

  let(:user) do
    User.new.tap do |u|
      u.name = name
      u.city = city
      u.url = url
    end
  end

  it "has a name, city, and URL" do
    user.name.should eq(name)
    user.city.should eq(city)
    user.url.should eq(url)
  end

  it "saves itself to a row" do
    key = user.save
    User.fetch(key).should eq(user)
  end

  it "supports lookup by city" do
    user.save
    User.fetch_by_city(user.city).should eq(user)
  end

end

Not much coupling going on here either. Coding in a blog post is full of beautiful idealism, isn’t it?

“Needs more realism”, says the critic. Obliged:

  class User::Model < ActiveRecord::Base
    set_table_name :users

    def self.create(name, city, url)
      super(:name => name, :city => city, :url => url)
    end

    def self.fetch(key)
      from_model(find(key))
    end

    def self.fetch_by_city(city)
      from_model(where(:city => city).first)
    end

    def self.from_model(model)
      User.new.tap do |u|
        u.name = model.name
        u.city = model.city
        u.url = model.url
      end
    end

  end

Here’s the first implementation of an actual access layer for my user model. It’s coupled to the actual user model by names, but it’s free to map those names to database tables, indexes, and queries as it sees fit. If I’m clever, I might write a shared example group for the behavior of whatever implements create, fetch, and fetch_by_city in User::Model, but I’ll leave that as an exercise to the reader.

To hook my model up when I run RSpec, I add a moderately involved before hook:

  before(:all) do
    ActiveRecord::Base.establish_connection(
      :adapter => 'sqlite3',
      :database => ':memory:'
    )

    ActiveRecord::Schema.define do
      create_table :users do |t|
        t.string :name, :null => false
        t.string :city, :null => false
        t.string :url
      end
    end
  end

As far as I know, this is about as simple as it gets to bootstrap ActiveRecord outside of a Rails test. So it goes.

Let’s fake that out

Now I’ve got a working implementation. Yay! However, it would be nice if I didn’t need all that ActiveRecord stuff when I’m running isolated, unit tests. Because my model and data access layer are decoupled, I can totally do that. Hold on to your pants:

require 'active_support/core_ext/class'

class User::Model
  cattr_accessor :users
  cattr_accessor :users_by_city

  def self.init
    self.users = {}
    self.users_by_city = {}
  end

  def self.create(name, city, url)
    key = Time.now.tv_sec
    hsh = {:name => name, :city => city, :url => url}
    users[key] = hsh
    users_by_city[city] = hsh
    key
  end

  def self.fetch(key)
    attrs = users[key]
    from_attrs(attrs)
  end

  def self.fetch_by_city(city)
    attrs = users_by_city[city]
    from_attrs(attrs)
  end

  def self.from_attrs(attrs)
    User.new.tap do |u|
      u.name = attrs[:name]
      u.city = attrs[:city]
      u.url = attrs[:url]
    end
  end

end

This “storage” layer is a bit more involved because I can’t lean on ActiveRecord to handle all the particulars for me. Specifically, I have to handle indexing the data in not one but two hashes. But, it fits on one screen and its in memory, so I get fast tests at not too much overhead.

This is a classic test fake. It’s not the real implementation of the object; it’s just enough for me to hack out tests that need to interact with the storage layer. It doesn’t tell me whether I’m doing anything wrong like a mock or stub might. It just gives me some behavior to collaborate with.

Switching my specs to use this fake is pretty darn easy. I just change my before hook to this:

  before { User::Model.init }

Life is good.

Now for some overkill

Time passes. Specs are written, code is implemented to pass them. The application grows. Life is good.

Then one day the ops guy wakes up, finds the site going crazy slow and see that there are a couple hundred million user in the system. That’s a lot of rows. We’re gonna need a bigger database.

Migrating millions of rows to a new database is a pretty big headache. Even if it’s fancy and distributed. But, it turns out changing our code doesn’t have to tax our brains so much. Say, for example, we chose Cassandra:

require 'cassandra/0.7'
require 'active_support/core_ext/class'

class User::Model

  cattr_accessor :connection
  cattr_accessor :cf

  def self.create(name, city, url)
    generate_key.tap do |k|
      cols = {"name" => name, "city" => city, "url" => url}
      connection.insert(cf, k, cols)
    end
  end

  def self.generate_key
    SimpleUUID::UUID.new.to_guid
  end

  def self.fetch(key)
    cols = connection.get(cf, key)
    from_columns(cols)
  end

  def self.fetch_by_city(city)
    expression = connection.create_index_expression("city", city, "EQ")
    index_clause = connection.create_index_clause([expression])
    slices = connection.get_indexed_slices(cf, index_clause)
    cols = hash_from_slices(slices).values.first
    from_columns(cols)
  end

  def self.from_columns(cols)
    User.new.tap do |u|
      u.name = cols["name"]
      u.city = cols["city"]
      u.url = cols["url"]
    end
  end

  def self.hash_from_slices(slices)
    slices.inject({}) do |hsh, (k, columns)|
      column_hash = columns.inject({}) do |inner, col|
      column = col.column
      inner.update(column.name => column.value)
      end
    hsh.update(k => column_hash)
    end
  end
end

Not nearly as simple as the ActiveRecord example. But sometimes it’s about making hard problems possible even if they’re not mindless retyping. In this case, I had to implement ID/key generation for myself (Cassandra doesn’t implement any of that). I also had to do some cleverness to generate an indexed query and then to convert the hashes that Cassandra returns into my User model.

But hey, look! I changed the whole underlying database without worrying too much about mucking with my domain models. I can dig that. Further, none of my specs need to know about Cassandra. I do need to test the interaction between Cassandra and the rest of my stack in an integration test, but that’s generally true of any kind of isolated testing.

This has all happened before and it will all happen again

None of this is new. Data access layers have been a thing for a long time. Maybe institutional memory and/or scars have prevented us from bringing them over from Smalltalk, Java, or C#.

I’m just sayin’, as you think about how to tease your system apart into decoupled, cohesive, easy-to-test units, you should pause and consider the idea that pushing all your persistence needs down into an object you later delegate to can make your future self think highly of your present self.

This ended up being a big mistake. I could have saved myself some pain, and our ops team even more pain, if I’d done an honest back-of-the-napkin calculation and stepped back for a few minutes to figure out a better angle on storage. ↩