• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Dion Almaer

Software, Development, Products

  • @dalmaer
  • LinkedIn
  • Medium
  • RSS
  • Show Search
Hide Search

Resilience

“I have never had a fight with my wife!” The Importance of Resilience

August 25, 2015 Leave a Comment


I overheard a conversation that you have probably heard a variant of yourself. A bloke was so proud of the fact that he hadn’t had a serious confrontation with his spouse to date. Ah, the perfect union.

While some joined him in appreciation, I had to hide the real thoughts going through my head:

“Oh crap, he may be lucky and truly have the perfect situation, or when something does come to a boil, they have never practiced the art of disagreement. They have never worked though a tricky situation”

Resilient Software

This reminded me of a similar conversation that I had awhile back, where I heard an admin act so very proud that one machine had been up for over a year. That scared the hell out of me too. It meant that a restart hadn’t been tested in over a year, and can you imagine the magic and cruft that was built up? That is one of the reasons why folks are excited about immutable servers, or at the very least having systems that get built up from scratch.

I am building a new application, and not only should it be mobile first, it should also be offline first. The majority of experiences should probably be architected in the same way. You notice the opposite these days, and often from applications that were built before the mobile revolution. It is often much harder to bolt on offline capability after the fact.

When you build an offline first client you tend to get some great side benefits:

  • If you are working on local data you can keep a responsive UI (as long as you are smart about keeping off the main thread!)
  • You can progressively enhance when online
  • For example, when DuoLingo matches your typed in answer a simple match can occur locally while a more complex match can be kicked off online (if the client is offline).

You can also get into a situation where you are out of sync. The client and server see differing versions of reality.

Shift+Reload

In retrospect we had it lucky in the Web 1.0 days. We could purely server render and the client was a dumb terminal that recreated itself on every request. As browsers got richer caching we needed to give users the nuclear option: Shift+Reload. Not exactly user friendly, but it sure came in handy (and still does!).

These days we need to make sure that our rich clients aren’t getting corrupt. For our clients to work well offline they have local state and data to work on. We have all been frustrated when there is a bug that you can’t easily restore from.

Borken Downloads

One example for me is installing applications on my devices. As I type I have gotten into the situation where my download is hung, yet I have no way to kick start it or even delete it. There is something so very infuriating when this happens, when a version of a shift+reload doesn’t fix the situation. Just yesterday a coworker and I created the same projects on Asana because we didn’t see that the other had already done so. It took forever for me to see his version, and be able to clean it up.

It is tough to get this right. We are trying to do the right thing for the user by caching and keeping their application responsive, yet we should take some hints and have systems to help out. If a user is killing and restarting their application, that is the modern shift+reload is it not?

Micro Services

In theory the birth of micro services and trying to hide the complexity of functionality behind nice clean decoupled APIs helps us with resiliency. In practice I have seen this turn out to be a real mess. It isn’t the fault of the practice, but rather the implementation details. Here is what I have seen go wrong:

The scope of the services isn’t defined correctly

  • In one example the scope seemed to be a function of team size vs. the natural composition of the functionality

Inter-dependency killers

  • There were separate small services, but they all depended on each other. The result was a lot of communication around “a new version of service X was deployed and it broke service Y in QA”

No view of the system as awhole

  • When something goes wrong, how are you made aware? How do you then find where the problem is? Due to not having enough of a view on the whole system it can be hard to get this information. You have to explicitly spend time on the seams

Poor exception handling

  • I hate it when all errors and exceptions are treated as equal. This results in socket closed exceptions being thrown into the mix where they weren’t errors at all…. the client just disconnected and it was fine! As soon as you get this wrong you get a sea of information that you can’t trust, and the killer errors can go un-noticed. I have seen shocking bugs live in production for far too long due to this :/

Finger pointing

  • The worst situations occur when you have constant finger pointing. Something is wrong in the system but each team is arguing about what is actually broken. Services teams point at each other, and point at the network guys, who point at the infrastructure guys who …. point back to the services folk!

Spending time up front to get ahead of this is vital. Certain platforms shine here too. Erlang is known for holding resiliency as its core tenet. Various reactive platforms do well, but although these can make life better for you, you need to care.

I have often had to hold my nose and do the impure. I have setup proxy layers that do automatic retries when the core backend should have been fixed. This is risky, because you can end up increasing the traffic and causing even more issues, but if done right it can save your bacon.

Have you gone through and spec’d the SLA needed for various services? So often we see a least common denominator when it is better to split things out. As an example, if you look at an API that gives you information on a product (description, price, availability, reviews, images, etc) you may want an up to date price but those reviews? Not so much. You can probably deal just fine without that one review that just came in. In this case you probably want to say the equivalent of:

“try to get the latest reviews, but if they don’t come back in Xms then use the last grabbed…. and when that call comes back update the cache for next time maybe, cool?”

It isn’t a surprise that hapi, which Eran Hammer and his team started with me at Walmart, does this pretty well thanks to a box for a cat, as well for handling microservices in general.


For a great modern experience that is fast and works well for your users, chances are that you should:

  • Build an offline first client, but give it enough intelligence to be able to handle corruption and get back to a clean state of health, even with nuclear options
  • Build a services tier that assumes failure at each tier, and that can deal with that failure gracefully
  • Progressively enhance the experience on both the client and the server to make sure the core service always works, but that it can also turn on features and tweaks when available.

And as soon as you have something running, start taking your code to counselling so the system can get good at dealing with disagreements and disruption 😉

Primary Sidebar

Twitter

My Tweets

Recent Posts

  • I have scissors all over my house
  • GenAI: Lessons working with LLMs
  • Generative AI: It’s Time to Get Into First Gear
  • Developer Docs + GenAI = ❤️
  • We keep confusing efficacy for effectiveness

Follow

  • LinkedIn
  • Medium
  • RSS
  • Twitter

Tags

3d Touch 2016 Active Recall Adaptive Design Agile Amazon Echo Android Android Development Apple Application Apps Artificial Intelligence Autocorrect blog Bots Brain Calendar Career Advice Cloud Computing Coding Cognitive Bias Commerce Communication Companies Conference Consciousness Cooking Cricket Cross Platform Deadline Delivery Design Desktop Developer Advocacy Developer Experience Developer Platform Developer Productivity Developer Relations Developers Developer Tools Development Distributed Teams Documentation DX Ecosystem Education Energy Engineering Engineering Mangement Entrepreneurship Exercise Family Fitness Founders Future GenAI Gender Equality Google Google Developer Google IO Habits Health HR Integrations JavaScript Jobs Jquery Kids Stories Kotlin Language Leadership Learning Lottery Machine Learning Management Messaging Metrics Micro Learning Microservices Microsoft Mobile Mobile App Development Mobile Apps Mobile Web Moving On NPM Open Source Organization Organization Design Pair Programming Paren Parenting Path Performance Platform Platform Thinking Politics Product Design Product Development Productivity Product Management Product Metrics Programming Progress Progressive Enhancement Progressive Web App Project Management Psychology Push Notifications pwa QA Rails React Reactive Remix Remote Working Resilience Ruby on Rails Screentime Self Improvement Service Worker Sharing Economy Shipping Shopify Short Story Silicon Valley Slack Software Software Development Spaced Repetition Speaking Startup Steve Jobs Study Teaching Team Building Tech Tech Ecosystems Technical Writing Technology Tools Transportation TV Series Twitter Typescript Uber UI Unknown User Experience User Testing UX vitals Voice Walmart Web Web Components Web Development Web Extensions Web Frameworks Web Performance Web Platform WWDC Yarn

Subscribe via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Archives

  • February 2023
  • January 2023
  • September 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • November 2021
  • August 2021
  • July 2021
  • February 2021
  • January 2021
  • May 2020
  • April 2020
  • October 2019
  • August 2019
  • July 2019
  • June 2019
  • April 2019
  • March 2019
  • January 2019
  • October 2018
  • August 2018
  • July 2018
  • May 2018
  • February 2018
  • December 2017
  • November 2017
  • September 2017
  • August 2017
  • July 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012

Search

Subscribe

RSS feed RSS - Posts

The right thing to do, is the right thing to do.

The right thing to do, is the right thing to do.

Dion Almaer

Copyright © 2023 · Log in

 

Loading Comments...