I’m going to pick on Resque here, since it’s nominally great and widely used. You’re probably using it in your application right now. Unfortunately, I need to tell you something unsettling.
There’s an outside chance your application is dropping jobs. The Resque worker process pulls a job off the queue, turns it into an object, and passes it to your class for processing. This handles the simple case beautifully. But failure cases are important, if tedious, to consider.
What happens if there’s an exception in your processing code? There’s a plugin for that. What happens if you need to restart your Resque process while a job is processing? There’s a signal for that. What if, between taking a job off the queue and fully processing it, the whole server disappears due to a network partition, hardware failure, or the FBI “borrowing” your rack?
Honestly, you shouldn’t treat Resque, or even Redis, like you would a relational or dynamo-style database. Redis, like memcached, is designed as a cache. You put things in it, you can get it out, really fast. Redis, like memcached, rarely falls over. But if it does, the remediation steps are manual. Redis currently doesn’t have a good High Availability setup (it’s being contemplated).
Further, Resque assumes that clients will properly process every message they dequeue. This isn’t a bad assumption. Most systems work most of the time. But, if a Resque worker process fails, it’s not great. It will lose all of the message(s) held in memory, and the Redis instance that runs your Resque queues is none the wiser.
In my past usage of Resque, this isn’t that big of a deal. Most jobs aren’t business-critical. If the occasional image doesn’t get resized or a notification email doesn’t go out, life goes on. A little logging and data tweaking cures many ills.
But, some jobs are business-critical. They need stronger semantics than Resque provides. The processing of those jobs, the state of that processing, is part of our application’s logic. We need to model those jobs in our application and store that state somewhere we can trust.
I first became really aware of this problem, and a nice solution to it, listening to the Ruby Rogues podcast. Therein, one of the panelists advised everyone to model crucial processing as state machines. The jobs become the transitions from one model state to the next. You store the state alongside an appropriate model in your database. If a job should get dropped, it’s possible to scan the database for models that are in an inconsistent state and issue the job again.
Let’s work an example. For our imaginary application, comment notifications are really important. We want to make sure they get sent, come hell or high water. Here’s what our comment model looks like originally:
class Comment after_create :send_notification def send_notification Resque.enqueue(NotifyUser, self.user_id, self.id) end end
Now we’ll add a job to send that notification:
class NotifyUser @queue = :notify_user def self.perform(user_id, comment_id) # Load the user and comment, send a notification! end end
But, as I’ve pointed out with great drama, this code can drop jobs. Let’s throw that state machine in:
class Comment # We'll use acts-as-state-machine. It's a classic. include AASM # We store the state of sending this notification in the aptly named # `notification_state` column. AASM gives us predicate methods to see if this # model is in the `pending?` or `sent?` states and a `notification_sent!` # method to go from one state to the next. aasm :column => :notification_state do state :pending, :initial => true state :sent event :notification_sent do transitions :to => :sent, :from => [:pending] end end after_create :send_notification def send_notification Resque.enqueue(NotifyUser, self.user_id, self.id) end end
Our notification has two states: pending, and sent. Our web app creates it in the pending state. After the job finishes, it will put it in the sent state.
class NotifyUser @queue = :notify_user def self.perform(user_id, comment_id) user = User.find(user_id) comment = Comment.find(comment_id) user.notify_for(comment) # Notification success! Update the comment's state. comment.notification_sent! end end
This a good start for more reliably processing jobs. However, most jobs happen to handle the interaction between two systems. This notification is a great example. It integrates our application with a mail server or another service that handles our notifications. Talking to those things is probably something that isn’t tolerant to duplicate requests. If our process croaks between the time it tells the mail server to send and the time it updates the notification state in our database, we could accidentally process this notification twice. Back to square one?
Not quite. We can reduce our problem space once more by adding another state to our model.
class Comment include AASM aasm :column => :notification_state do state :pending, :initial => true state :sending # We have attempted to send a notification state :sent # The notification succeeded state :error # Something is amiss :( # This happens right before we attempt to send the notification event :notification_attempted do transitions :to => :sending, :from [:pending] end # We take this transition if an exception occurs event :notification_error do transitions :to => :error, :from => [:sending] end # When everything goes to plan, we take this transition event :notification_sent do transitions :to => :sent, :from => [:sending] end end after_create :send_notification def send_notification Resque.enqueue(NotifyUser, self.user_id, self.id) end end
Now, when our job processes a notification, it first uses
notification_attempted. Should this job fail, we’ll know which jobs we should look for in our logs. We could also get a little sneaky and monitor the number of jobs in this state if we think we’re bottlenecking around sending the actual notification. Once the job completes, we transition to the
sent state. If anything goes wrong, we catch the exception and put the job in the
error state. We definitely want to monitor this state and use the logs to figure out what went wrong, manually fix it, and perhaps write some code to fix bugs or add robustness.
sending state is entered when at least one worker has picked up a notification and tried to send the message. Should that worker fail in sending the message or updating the database, we will know. When trouble strikes, we’ll know we have two cases to deal with: notifications that haven’t been touched at all, and notifications that were attempted and may have succeeded. The former, we’ll handle by requeueing them. The latter, we’ll probably have to write a script to grep our mail logs and see if we successfully sent the message. (You are logging everything, centrally aggregating it, and know how to search it, right?)
The truth is, the integration points between systems is a gnarly problem. You don’t so much solve the problem; you get really good at detecting and responding to edge cases. Thus is life in production. But losing jobs, we can make really good progress on that. Don’t worry about your low-value jobs; start with the really important ones, and weave the state of those jobs into the rest of your application. Some day, you’ll thank yourself for doing that.