Software Development

“I have never had a fight with my wife!” The Importance of Resilience

August 25, 2015 Leave a Comment

I overheard a conversation that you have probably heard a variant of yourself. A bloke was so proud of the fact that he hadn’t had a serious confrontation with his spouse to date. Ah, the perfect union.

While some joined him in appreciation, I had to hide the real thoughts going through my head:

“Oh crap, he may be lucky and truly have the perfect situation, or when something does come to a boil, they have never practiced the art of disagreement. They have never worked though a tricky situation”

Resilient Software

This reminded me of a similar conversation that I had awhile back, where I heard an admin act so very proud that one machine had been up for over a year. That scared the hell out of me too. It meant that a restart hadn’t been tested in over a year, and can you imagine the magic and cruft that was built up? That is one of the reasons why folks are excited about immutable servers, or at the very least having systems that get built up from scratch.

I am building a new application, and not only should it be mobile first, it should also be offline first. The majority of experiences should probably be architected in the same way. You notice the opposite these days, and often from applications that were built before the mobile revolution. It is often much harder to bolt on offline capability after the fact.

When you build an offline first client you tend to get some great side benefits:

If you are working on local data you can keep a responsive UI (as long as you are smart about keeping off the main thread!)
You can progressively enhance when online
For example, when DuoLingo matches your typed in answer a simple match can occur locally while a more complex match can be kicked off online (if the client is offline).

You can also get into a situation where you are out of sync. The client and server see differing versions of reality.

Shift+Reload

In retrospect we had it lucky in the Web 1.0 days. We could purely server render and the client was a dumb terminal that recreated itself on every request. As browsers got richer caching we needed to give users the nuclear option: Shift+Reload. Not exactly user friendly, but it sure came in handy (and still does!).

These days we need to make sure that our rich clients aren’t getting corrupt. For our clients to work well offline they have local state and data to work on. We have all been frustrated when there is a bug that you can’t easily restore from.

One example for me is installing applications on my devices. As I type I have gotten into the situation where my download is hung, yet I have no way to kick start it or even delete it. There is something so very infuriating when this happens, when a version of a shift+reload doesn’t fix the situation. Just yesterday a coworker and I created the same projects on Asana because we didn’t see that the other had already done so. It took forever for me to see his version, and be able to clean it up.

It is tough to get this right. We are trying to do the right thing for the user by caching and keeping their application responsive, yet we should take some hints and have systems to help out. If a user is killing and restarting their application, that is the modern shift+reload is it not?

Micro Services

In theory the birth of micro services and trying to hide the complexity of functionality behind nice clean decoupled APIs helps us with resiliency. In practice I have seen this turn out to be a real mess. It isn’t the fault of the practice, but rather the implementation details. Here is what I have seen go wrong:

The scope of the services isn’t defined correctly

In one example the scope seemed to be a function of team size vs. the natural composition of the functionality

Inter-dependency killers

There were separate small services, but they all depended on each other. The result was a lot of communication around “a new version of service X was deployed and it broke service Y in QA”

No view of the system as awhole

When something goes wrong, how are you made aware? How do you then find where the problem is? Due to not having enough of a view on the whole system it can be hard to get this information. You have to explicitly spend time on the seams

Poor exception handling

I hate it when all errors and exceptions are treated as equal. This results in socket closed exceptions being thrown into the mix where they weren’t errors at all…. the client just disconnected and it was fine! As soon as you get this wrong you get a sea of information that you can’t trust, and the killer errors can go un-noticed. I have seen shocking bugs live in production for far too long due to this :/

Finger pointing

The worst situations occur when you have constant finger pointing. Something is wrong in the system but each team is arguing about what is actually broken. Services teams point at each other, and point at the network guys, who point at the infrastructure guys who …. point back to the services folk!

Spending time up front to get ahead of this is vital. Certain platforms shine here too. Erlang is known for holding resiliency as its core tenet. Various reactive platforms do well, but although these can make life better for you, you need to care.

I have often had to hold my nose and do the impure. I have setup proxy layers that do automatic retries when the core backend should have been fixed. This is risky, because you can end up increasing the traffic and causing even more issues, but if done right it can save your bacon.

Have you gone through and spec’d the SLA needed for various services? So often we see a least common denominator when it is better to split things out. As an example, if you look at an API that gives you information on a product (description, price, availability, reviews, images, etc) you may want an up to date price but those reviews? Not so much. You can probably deal just fine without that one review that just came in. In this case you probably want to say the equivalent of:

“try to get the latest reviews, but if they don’t come back in Xms then use the last grabbed…. and when that call comes back update the cache for next time maybe, cool?”

It isn’t a surprise that hapi, which Eran Hammer and his team started with me at Walmart, does this pretty well thanks to a box for a cat, as well for handling microservices in general.

For a great modern experience that is fast and works well for your users, chances are that you should:

Build an offline first client, but give it enough intelligence to be able to handle corruption and get back to a clean state of health, even with nuclear options
Build a services tier that assumes failure at each tier, and that can deal with that failure gracefully
Progressively enhance the experience on both the client and the server to make sure the core service always works, but that it can also turn on features and tweaks when available.

And as soon as you have something running, start taking your code to counselling so the system can get good at dealing with disagreements and disruption 😉

Delivering software on time is important, but not most important

July 14, 2015 Leave a Comment

https://twitter.com/kartar/status/619587592300969984

Reliable software delivery is welcome, and an ideal trait of a great product engineering team. Most would trade off a slightly slower pace of delivery to gain predictability.

This level of execution is tough to come by. The team needs to learn to work well together but that isn’t enough. Just as all teams aren’t equal, all problems aren’t equal too. You may be able to get into a predictable rhythm when it comes to estimating the time it will take to deliver a screen when the API is already stable, but if there are more unknowns (usually the case) it gets harder. And then there is true R&D. You can’t predict the unknown, so if a team doesn’t understand how they are going to solve a problem then your estimate could be wildly off.

The thing is: that is OK! This is how creative work happens!

One frustration I have with The Business wanting fixed deadlines is that they rarely appear to have time to understand any of the nuance, risk, and unknown. They want a date, even if it is a false sense of security and doesn’t represent reality. Tools such as LiquidPlanner that try to put in as much of the uncertainty as possible can help visualize this nuance. If you are giving an absolute amount of work then chances are you are very wrong. Favor ranges over absolutes and push to get people thinking in that way. Understand how any fuzzy prediction gets clearer as it gets closer (like the weather!).

This is often a tough sell, which I have always found interesting given that 90% of the projects have all had the goal posts moved, often near the end. Teams can be scared to change the date because “we don’t want to be flaky” so they keep holding their breath and hoping that heroics or luck will save the day. Sometimes they do, but do you want to run your business (or life) that way?

I feel awful when an engineer saves the day through heroics as it means that I didn’t do my job. Great engineers have done this countless times in my teams, and I celebrate them whilst feeling the personal frustration.

We naturally want to keep commitments, and being good partners is very important, but transparency and doing the work as a team is more critical than keeping false views. Transparency allows for understanding and an easier changing of scope and incremental tweaks along the journey.

Then you get to one of the worst sins: shipping to hit the date. Teams persuade themselves that the risk is worth it, that it is “good enough”, and prioritize the commitment to partners over customers. This tends to burn you though, as what really gets remembered? the quality of the product. If it is buggy, or suffers downtime, the team will be scrambling and paying the price for some time. There will be pressure to ship something, especially if it has slipped a couple times, but what is actually remembered is the product and how well it performs. That extra couple of weeks of testing and polish may be critical, so we shouldn’t take the talk of “MVP” as meaning “ship stuff that isn’t tested” (MVP is about the feature set, not the quality).

Some dates are more sacred that others. If you are in retail you understand that there is a bit of a difference in shipping functionality in October vs. February. You have to be ready for Black Friday and the holiday season. This means that your processes need to change accordingly. Not only do you need to account for those periods where the “dates can’t move” (and thus scope etc has to), but minimize the importance of dates at other times during the year to give the teams a freaking break.

I know that someone promised some feature to some team for a March 1st ship date. You know what, software happens, and the fact that it ships in April isn’t the end of the world.

Be proud of what you ship, and how the customers experience you, and prioritize that above politics. The politics will probably take care of themselves: when the product does ship and it does well, the stake holders will appreciate it and will forget that it was a little late. The customers (and stake holders) won’t forget the product that shipped on time but blew up in everyones face.

I have had this post in the queue for awhile, but it felt like I should finish it up and post it on the day that NASA comes to Pluto 72 seconds ahead of schedule on a 9 year mission.

That is both impressive, and showcases the trade offs needed to get precision.

What is also interesting is that we can get the drone to Pluto, billions of miles away, but we can’t keep up the website that talks about it. Huh!

I will take a drink and tip my hat to the engineers at NASA as we watch game time at 5:36pm Pacific Time!

Habit Driven Development

June 30, 2015 Leave a Comment

I have happened upon the importance of habits recently. I have always heard “good habits are important”, but I never really embraced that in a thoughtful way.

That changed when I had a health crisis. I realized that I needed to make a change to get healthy. I needed to create new habits with respect to nutrition, exercise, and holistic health (including mental health, sleep, you name it!).

I would often set myself lofty goals, and then if it didn’t look like I would reach them, everything would fizzle out. On the contrary if I can do something incremental on a daily basis it sticks, given enough repetitions to kick in.

Software Habits

A lot of the changes that we have seen in software development revolve around habits too.

Agile

I tend to dislike proprietary terminology or One True Way. There has been a backlash towards Agile ™ as many have seen it slip into dogma. As soon as you find that you have forgotten why you are doing something you are in trouble. The agile manifesto itself is a simple document that talks about values: favoring X over Y.

Some took this and created religion around them. There are fascinating spiritual questions, and long term values and learnings around how to live a good life. Certain folk managed to persuade people that their recipe is Truth whilst the thousands of other recipes are so wrong that you can end up in hell for believing them. Fortunately we don’t think that Scrum is an infallible document passed to man from God, so let’s keep trying new things and focusing on what works, what doesn’t, and why.

Athiests can also tend to poo poo particular practices. Some are bizarre and even barbaric, but others make sense. Alain de Botton lays out this case well in his book, Religion for Atheists.

Let’s not make the same mistake and ignore the good things that have come from various methodologies out there. Let’s not try to setup our own habits because we are scared to “be as bad as them”. Checklists don’t stiffle, they can save you from making mistakes, forgetting good knowledge, and give your brain the time and space to noodle on the important problems at hand. Dogma occurs when you forget why you are doing something and refuse to change with new knowledge and learnings. Practices that do change upon reflection, including looking at first principles, are a great thing.

Continous X

There are many tasks in development that we have managed to chunk up and allow us to do them more frequently. I am old enough to remember the bad days where people were scared to do a certain task because they couldn’t trust how it would go. At every release you would see people with crossed fingers, first for the release to get out well, and second to make sure that if a rollback was needed it actually worked. Today we release all the time and some are actually pulling off continuous deployments and ideally delivery.

We have seen the same thing elsewhere:

Writing and running tests
Creating and merging branches
Setting up infrastructure.

There is a huge pay off when you can trust your process. You can deliver higher quality product with less risk and a much improved pace over time to boot.

There are severe penalties for not catching things early. I remember the work at Palm where they found that: if you catch a bug on the same day it was introduced you can fix it in around an hour. If you catch that bug later it can take 24 times as long, and take a day. It is obvious why: your brain was right there so you don’t need to context shift, and the changes are few.

The core of an good process in my mind is simple:

What are the core values of the team
What practices map to those values
What habits will result in an improvement
Reflect and iterate.

A process that shows agility seeds the core values with the ability to change quickly because of the simple observation that those things that can’t evolve tend to die. This doesn’t mean that the process is fast. It may take effort to build agility into your software, but the bet is that it pays off and gives you the best chance of not getting stuck in the future.

The key to non-dogma is the retrospective. As long as you can revisit and try new things you can get to where you need. Just as with A/B testing, sometimes you can’t iterate to a better solution quick enough. Iteration is great if your dart was pretty close to the mark but sometimes you should throw another dart.

What are your and your teams habits? When was the last time you took a deep look at the why as well as the how?

« Previous Page