Not that kind of log

First, read all of this excellent distillation of distributed systems by Jay Kreps, The Log: What every software engineer should know about real-time data’s unifying abstraction. Now, consider this.

There’s a moment, when you’re building services and web APIs, when you think you’ve pretty much got it under control. You’ve got an endpoint for every query, a resource for every workflow. All the use-cases seem to be under control. And then, the question appears:

“How can I get access to all the updates to all the data? You know, for [REASONS].”

For APIs exposed to external developers over the web, this is where you’d reach for web hooks or PubSubHubbub. It’s not the best solution, but it works. If you’re building an internal system, you could use the same approaches, or…you could build a log.

No not that kind of log. An event log, like LinkedIn did with Kafka for their internal systems. Every time your data model changes, every create, update, or delete, you drop an event with all the metadata related to the change. The event goes into some kind of single-producer, multiple consumer queue. Then all the clients that want to know about all the changes to all the things can read events off the queue and do whatever it is they need to for those important REASONS.

If you find this intriguing, this is a lot like replication in database systems. Definitely read LinkedIn’s article on this, and definitely read up on how your database of choice handles replication. And if you’ve built this before and have a good answer to initially populating “replicas” of a database, let me know; I haven’t come up with anything better than “just rsync it”.