Aliens ate my program's state

So, Adam from a few years ago, you think you can build a distributed system?

Designing a concurrent system, one that runs across multiple processors using shared memory threads, is difficult because someone can move your cheese. One moment you’ve got food on your plate, and the next moment it’s on someone else’s plate.

Designing distributed systems, one that runs across multiple computers over networks that occasionally do weird things, is difficult because you can’t assume that the message will get through. One moment you’re telling your friend about an awesome new album and the next moment they don’t even know who you are anymore.

In both scenarios, you can’t assume the simplest case. You have to discard Occam’s razor; it’s entirely possible a perfectly devious alien is tinkering with your system, swallowing messages or preventing threads from running. For every operation in your system, you have to ask yourself what could go wrong and how will my system deal with that?

The first jarring thing about these sorts of problems is that you’re probably not using a framework. There are libraries to help you with logging, instrumentation, storage, coordination, and low-level primitives. There aren’t a lot of well-curated collections of opinions expressed as code. There’s no Rails, no Play, no Django, no Cocoa. There’s hardly even a Sinatra, a Celery, a JDBC.

That means you’re going to be designing, for real. Drawing on whiteboard, building prototypes and proofs of concepts. Getting feedback from as many smart people as you can corner. Questioning your assumptions, reading everything you can find on the topic. Looking for tradeoffs that decrease risk and simplify your problem domain.

Accept that you’re unlikely to ever get it entirely right. Those who are really good at it are more like Scotty than Spock. They’re surrounded by levers and measurements, tweaking and rebuilding the system as they go. It’s a fun puzzle!