Resilience

I overheard a conversation that you have probably heard a variant of yourself. A bloke was so proud of the fact that he hadn’t had a serious confrontation with his spouse to date. Ah, the perfect union.

While some joined him in appreciation, I had to hide the real thoughts going through my head:

“Oh crap, he may be lucky and truly have the perfect situation, or when something does come to a boil, they have never practiced the art of disagreement. They have never worked though a tricky situation”

Resilient Software

This reminded me of a similar conversation that I had awhile back, where I heard an admin act so very proud that one machine had been up for over a year. That scared the hell out of me too. It meant that a restart hadn’t been tested in over a year, and can you imagine the magic and cruft that was built up? That is one of the reasons why folks are excited about immutable servers, or at the very least having systems that get built up from scratch.

I am building a new application, and not only should it be mobile first, it should also be offline first. The majority of experiences should probably be architected in the same way. You notice the opposite these days, and often from applications that were built before the mobile revolution. It is often much harder to bolt on offline capability after the fact.

When you build an offline first client you tend to get some great side benefits:

If you are working on local data you can keep a responsive UI (as long as you are smart about keeping off the main thread!)
You can progressively enhance when online
For example, when DuoLingo matches your typed in answer a simple match can occur locally while a more complex match can be kicked off online (if the client is offline).

You can also get into a situation where you are out of sync. The client and server see differing versions of reality.

Shift+Reload

In retrospect we had it lucky in the Web 1.0 days. We could purely server render and the client was a dumb terminal that recreated itself on every request. As browsers got richer caching we needed to give users the nuclear option: Shift+Reload. Not exactly user friendly, but it sure came in handy (and still does!).

These days we need to make sure that our rich clients aren’t getting corrupt. For our clients to work well offline they have local state and data to work on. We have all been frustrated when there is a bug that you can’t easily restore from.

One example for me is installing applications on my devices. As I type I have gotten into the situation where my download is hung, yet I have no way to kick start it or even delete it. There is something so very infuriating when this happens, when a version of a shift+reload doesn’t fix the situation. Just yesterday a coworker and I created the same projects on Asana because we didn’t see that the other had already done so. It took forever for me to see his version, and be able to clean it up.

It is tough to get this right. We are trying to do the right thing for the user by caching and keeping their application responsive, yet we should take some hints and have systems to help out. If a user is killing and restarting their application, that is the modern shift+reload is it not?

Micro Services

In theory the birth of micro services and trying to hide the complexity of functionality behind nice clean decoupled APIs helps us with resiliency. In practice I have seen this turn out to be a real mess. It isn’t the fault of the practice, but rather the implementation details. Here is what I have seen go wrong:

The scope of the services isn’t defined correctly

In one example the scope seemed to be a function of team size vs. the natural composition of the functionality

Inter-dependency killers

There were separate small services, but they all depended on each other. The result was a lot of communication around “a new version of service X was deployed and it broke service Y in QA”

No view of the system as awhole

When something goes wrong, how are you made aware? How do you then find where the problem is? Due to not having enough of a view on the whole system it can be hard to get this information. You have to explicitly spend time on the seams

Poor exception handling

I hate it when all errors and exceptions are treated as equal. This results in socket closed exceptions being thrown into the mix where they weren’t errors at all…. the client just disconnected and it was fine! As soon as you get this wrong you get a sea of information that you can’t trust, and the killer errors can go un-noticed. I have seen shocking bugs live in production for far too long due to this :/

Finger pointing

The worst situations occur when you have constant finger pointing. Something is wrong in the system but each team is arguing about what is actually broken. Services teams point at each other, and point at the network guys, who point at the infrastructure guys who …. point back to the services folk!

Spending time up front to get ahead of this is vital. Certain platforms shine here too. Erlang is known for holding resiliency as its core tenet. Various reactive platforms do well, but although these can make life better for you, you need to care.

I have often had to hold my nose and do the impure. I have setup proxy layers that do automatic retries when the core backend should have been fixed. This is risky, because you can end up increasing the traffic and causing even more issues, but if done right it can save your bacon.

Have you gone through and spec’d the SLA needed for various services? So often we see a least common denominator when it is better to split things out. As an example, if you look at an API that gives you information on a product (description, price, availability, reviews, images, etc) you may want an up to date price but those reviews? Not so much. You can probably deal just fine without that one review that just came in. In this case you probably want to say the equivalent of:

“try to get the latest reviews, but if they don’t come back in Xms then use the last grabbed…. and when that call comes back update the cache for next time maybe, cool?”

It isn’t a surprise that hapi, which Eran Hammer and his team started with me at Walmart, does this pretty well thanks to a box for a cat, as well for handling microservices in general.

For a great modern experience that is fast and works well for your users, chances are that you should:

Build an offline first client, but give it enough intelligence to be able to handle corruption and get back to a clean state of health, even with nuclear options
Build a services tier that assumes failure at each tier, and that can deal with that failure gracefully
Progressively enhance the experience on both the client and the server to make sure the core service always works, but that it can also turn on features and tweaks when available.

And as soon as you have something running, start taking your code to counselling so the system can get good at dealing with disagreements and disruption 😉