Skip navigation

Architecting for Systems at Scale

As good as statistical modeling can be, when you lack the data, your model turns into a guess.

By R. Leigh Hennig, Principal Network Architect for the Markley Group.

When speaking to a group of college students on my involvement with the public streaming of a high-profile event, I was asked by one intern how it was that we could possibly plan for something for which the demand requirements were unknown. The event itself would be a televised event, the first of its kind, and while the criticality of the event could not be understated, we had no idea on the kind of demand the system would be expected to handle. Since this event hadn’t been done before, we lacked the data from which to describe an accurate capacity model.

 “Could you just provision way more capacity than you expected?” the intern asked.

If only it were that simple.

The problem, of course, is familiar to many of us in system and network design as well as among our DevOps teams. It’s a big part of what makes cloud computing attractive to organizations in the first place, and if you’ve heard of this word before, you may begin to sense where I’m going with this: elasticity.

As good as statistical modeling can be, when you lack the data, your ‘model’ turns into a ‘guess.’ What if the demand for your planned event blows your modeling out of the water, and you find yourself scrambling to provision the resources necessary to support 400,000 requests per second instead of the 200,000 requests you planned on? Users can’t access your site during a public launch, downloads fail, your servers (if they’re even still capable of responding) start spitting out those dreaded HTTP 500 errors. Traffic crushes the available throughput, congestion arises, packets are forever lost to the ether(net). Public trust is lost, excitement evaporates and before you know it, you don’t need to worry about scale because there aren’t enough users to trouble your previously beleaguered servers.

Or, the opposite happens: you over-provision, and now have more capacity than you know what to do with, and resources sit idle while your cloud vendor happily gets fat off your expensive support contracts. There are worse problems to have, sure – just don’t tell your CFO that.

Allow me to posit a third problem: suppose we do have the data, but for whatever reason, there’s a drastic spike outside of our prediction band? Large retailers are familiar with this one, as they sometimes see it during flash sale events (such as Cyber Monday) that suddenly drive sharp increases in traffic. Other world events may also throw a wrench into the mix at times, like when mass events change human behavior. England's power grid can speak to this, when half-time during the World Cup saw untold scores of citizens switching on the tea kettle, placing sudden demands on power systems.

How then do we answer the question of our bright and intrepid intern? If dropping ten times the resources we think we need isn’t a viable option, what do we do? When the data isn’t there to make anything but a guess, or when anomalies break our modeling?

We think differently. We change the equation.

Instead of asking ourselves, “How do I support a theoretical?” we should ask, “How do I move to support a theoretical?” And that’s a very different calculus indeed – one of which we can much more easily approach.

When working on hyper-scale infrastructure, I became accustomed to doing things differently. When you operate in this kind of environment, traditional thinking goes out the window. The questions we need to attack as a matter of routine in our day-to-day are never ones of how we support 10,000 or 100,000 or even 10,000,000 users. Instead, we need to ask ourselves how we would go from supporting n to n*x of users. From there, all kinds of interesting and critical questions arise, and at the root of them is velocity.

This type of thinking – a retrospective that is intended to analyze, in exact and excruciating detail, what went wrong, why, and how it could be avoided going forward – can be applied to any environment. There simply needs to be a standard question set that gets addressed, no matter what the issue. One of them goes something like this: As a thought exercise, how could you reduce the time to resolution by half?

Let’s go back to our capacity question and the unknown demand. You should know what it looks like to provision and deploy a router in your network that will support some amount of throughput. In a more traditional data center environment, the following comes into play: rack/space/power/cooling availability, time to vendor engagement, shipping estimates, staff availability for rack and stack, code and configuration deployment estimates.

If you’re in a cloud environment, you may be looking at API call delays or provisioning times. Traffic engineering or route convergence is likely to be thrown in the mix. Whatever the particulars of your environment, these are answers you have, or can easily obtain.

Do you have that figure? Do you know the time estimate, from the time someone says ‘jump’ to the time your feet are actually off the ground? Good. Now, how could you reduce the time to resolution by half? It’ll likely involve automation at some layer. Standardization not just in writing, but in actual practice. If your lead engineer kicks off to London on a vacation for six weeks, can someone without context walk in behind him and pick up where he left off? What about containerization of services (and I don’t just mean Kubernetes)? Are your vendor engagements and data center management staff managed by one monolithic department? Are you properly utilizing APIs, and can your load balancing stack be easily restriped upon the introduction of an nth device?

How many users can you support today, and how many do you need to support tomorrow? Even if you did know, it doesn’t matter, because that’s the wrong question. How quickly can you go from n capacity to n*y capacity? That’s the key question you need to solve for.

Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Informa.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating.

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.