Hapi Thanksgiving; It is about systems of nodes scaling, not node scaling

Not being a native, I still haven’t worked out everything Thanksgiving. It feels like a dry run for the holiday period, and being in retail it became game time.

On the day before our superbowl (Black Friday) I wanted to give my “thanks”. Gratitude is something I can very much get into!

Last year Eran Hammer lead the charge on #nodebf, where he live tweeted how the systems were handling. It was the first year of big node usage, and every year the systems get a larger work out.

Fast forward a year and we have multiple teams using node, and some of them are kindly joining the fun and live tweeting at #nodebf again!

We wanted to give a feel for what the systems do, and the planning behind dealing with the traffic, so we have a couple write-ups for you:

Server Side JavaScript Rendering

The mobile website started off as a client side SPA app. However, for the sake of performance, initial load, and SEO reasons we pushed on a server side rendering solution. Now, instead of the client browser, we have a proxy for our customers running node processes to render the application that the client then hijacks. To really push the bar Kevin Decker decided to hack the rendering pipeline to be able to tweak between async and sync on client vs. server. Here’s what Kevin has to say about it all:

“This year marks the first year that we are doing full scale rendering of our SPA application on our mobile.walmart.com Node.js tier, which has provided a number of challenges that are very different from the mostly IO-bound load of our prior #nodebf.

For all intensive purposes, the infrastructure outlined for last year is the same but our Home, Item and a few other pages are prerendered on the server using fruit-loops and hula-hoop to execute an optimized version of our client-side JavaScript and provide a SEO and first-load friendly version of the site.

To support the additional CPU load concerns as peak, which we hope will be unfounded or mitigated by our work, we have also taken a variety of steps to increase cache lifetimes of the pages that are being served in this manner. In order of their impact:”

Follow him, and read more to get the full scoop on the taming of the event loop.

Quimby, our three legged stool

We think that the three legged stool that brings together analytics, A/B testing, and client configuration is crucial.

Jason Pimcin talks about the new Quimby service that ties it all together to make life sing:

“Quimby is Walmart’s service layer for mobile clients’ configuration, CMS, a-b testing setup, and a few other sundry related services. It stitches together a constellation of data sources into a concise menu of API calls that mobile clients make to intialize and configure themselves.

Quimby is a REST service layer based upon the Gogo micro-service framework that we in turn built with Node.js, Hapi, Zookeeper, and Redis. Gogo is able to expose an array of web servers as a single host, and offers the ability to isolate tasks into smaller focused processes, emphasizing scalability and failure recovery. For example, a failure in any micro-service will not affect the life cycle of a request. Gogo also offers the additional features required to build distributed services with shared state, such as leader election.”

Scaling systems not languages or platforms

We are fans of node, but we also have a ton of other systems out there powering Black Friday. They range beyond node to Java and Clojure, but there is one thing that ties together the systems…. the people.

While we love to focus on the black and white of technology, and try to make outrageous claims “X is the best way to scale!”, it isn’t about “scaling node”. Our best systems share other more important patterns:

  • Measurement from the beginning: how can you know if you can or can’t scale for a big event without measuring your systems?
  • Your test systems should be as close to production as possible (else the measurements are weak)
  • The architecture must enable scalable performance so there isn’t a cliff due to a dependency that is brittle.
  • Failover is baked in and assumed. Things will go wrong and break at any scale, so you have to assume it and work around it. We have a fair few flags and configuration to reroute manually, and a fair share of automated smarts!

As I settle in to watch the rampant Thanksgiving that is the calm before the Black Friday and Cyber Monday storm, I give thanks to the giants shoulders that we stand on, and a huge thanks to my team. I love these engineers (and I even don’t mind the other folks ;)

Wish us luck!

Leave a comment

Your email address will not be published. Required fields are marked *