Part 3 of a series (Part 1, Part 2)
HERE WE GO AGAIN
I will look at this from one more angle and then I will let it rest here for future reference, since pretty much everyone else seems, not surprisingly, to have moved on. With the aside on how we go about discussing this topic out of the way (and various other digressions) in my post last Sunday, I wanted to focus a bit on what is perhaps the center of the argument used in the Times article. At the very least, elaborating on the flaws the center should get us very close to exposing the feebleness of the rest of the argument’s construction.
At the core of the argument is the following paragraph:
Energy efficiency varies widely from company to company. But at the request of The Times, the consulting firm McKinsey & Company analyzed energy use by data centers and found that, on average, they were using only 6 percent to 12 percent of the electricity powering their servers to perform computations. The rest was essentially used to keep servers idling and ready in case of a surge in activity that could slow or crash their operations.
In my response I took issue with this paragraph in two specific areas: 1) that an average is meaningless without more information (e.g. the standard deviation, for starters) and 2) that the measure of utilization they imply, and I say imply because they never make it clear –another flaw–, was one of “performing computations.” I elaborated a bit on the various types of tasks servers may be performing and noted that you couldn’t amalgamate them all into a single value. This is true, but I want to step back a bit into what utilization means and the unmentioned factor that is on the other side of it: efficiency.
WHERE IT BECOMES CLEAR THAT TERMINOLOGY MATTERS (A LOT)
Semantics time: we need to define some terms. Let’s say, just for a moment, to simplify the discussion a bit, that we’re ok talking about utilization as some kind of aggregate. Let’s say that we ignore the specifics of what the servers are doing and we further assume that we will use a percentage of utilization of the “system,” to some percent between 0 and 100, with zero being the system is doing nothing but its own minimal housekeeping tasks, so zero isn’t really zero, but we’re simplifying, and 100 being the system is fully utilized doing something specifically related to the application at hand. I should start though, with some terminology housekeeping by defining what “a system” is.
Definition 0: I will use the words system and machine interchangeably to mean a particular piece of hardware or virtual instance, typically a server, running a particular piece of software. Mixing up virtual systems and machines with actual hardware is a bit of a shortcut, but since in practice virtual systems must eventually map to a real one, it’s one that I think we can live with. (Another shortcut lies in “typically a server” since, say, network switches should also be part of the equation.)
Definition 1: Utilization: a percentage between 0 and 100 of system load related to the specific application tasks at hand.
I shudder at the oversimplification, but I’ll get over it. Probably.
Now, related to utilization is efficiency. While utilization can be said to be an objective concept that is measurable, efficiency can only really be understood relative to something else, and in the case of the a piece of software as relative to previous versions itself. So for example we can’t really say anything reasonable about the efficiency of a V1 piece of software except with respect to imagined possible changes in the future. Conversely, the efficiency of V2, V3, etc will be defined in relation to whatever version or versions preceded them. Now that we have the definition of utilization, though, we can talk about efficiency in terms of that. So, for example, if V2 uses half the systems that V1 used, then it’s twice as efficient (2x). Or V2 could require 20 machines while V1 required 10, in which case V2 is half as efficient.
Definition 2: Efficiency: the change in utilization between two versions of the system.
Once again — for all data center people out there, I’m oversimplifying for the sake of this particular argument.
In the paragraph quoted from the Times above there’s a bit of a jumble of terms. It starts talking about “energy efficiency” (which, in our definition, would be 100%-Utilization%) and then it talks about using “they were using only 6 percent to 12 percent” which is straight utilization%. I think playing fast and loose with terminology like that can get in the way of really knowing what we’re talking about, which is why I’ve spent some time defining, at least, what I’m talking about.
INSERT OBVIOUS TRANSITIONAL SECTION TITLE HERE
Ok, fine, the squirrel in charge of coming up with section titles is on a break and I can’t think of a good one, so let’s just keep going now that we’re armed with these terms. I’ve repeatedly stated that the average utilization, by itself, is meaningless. Allow me to elaborate on one key reason why. Suppose you measure ten systems and get the following utilization values: 10, 10, 10, 10, 10, 90, 90, 90, 90, 90. This gives you an average utilization of 50% — already something to note, since the average clearly isn’t telling the whole story. Assuming for a moment that these values are comparable across systems (a big if, but again: simplify!), the average is really hiding something important: five systems have “bad” utilization (10%) while the other five have “good” utilization (90%).
But wait, why am I using quotes around “good” and “bad”? Because this is something the article implies: that more utilization is better, but this is exactly why talking about efficiency matters. Maybe, just maybe, the five systems at 10% are actually systems with software that do the same thing but more efficiently, and from that perspective our assessment of good or bad can get inverted.
Suppose the five systems at 90% are part of a cluster. Suppose one of the five system crashes. Suddenly the load that went to that system has to be distributed to all the others and it quickly puts the remaining four systems at over 100% and the entire cluster could crash. The load would never get distributed perfectly across systems among other problems — in reality we’d probably looking at a cascade of failures as each individual remaining system crosses the 100% threshold, putting more load on others, and so forth, ending with the final state where the entire cluster is down and everyone, from the CEO to your users, is screaming bloody murder.
Suddenly we’re looking at those systems at 90% with some suspicion, no? In fact if you keep on the simplification vibe, you could argue that for a five-system cluster the only way to protect it from any one machine going down (assuming a crash at 100% load, also not a given…) is to maintain the average utilization at around 75% or so, which in the event of one system crashing would leave the remaining four at 93.75%. 80% average would mean one system crashing leaves everyone at 100%. So the difference between 75% and 80%, which seems minimal, is the difference between life and death in this scenario.
But I said can get inverted because there’s other factors at play. Take just one: uptime requirements. Suppose you’re somehow OK with the idea that one system crashing takes everything down, and you’re willing to trade uptime for costs (ie., using only five machines). Then you’d be ok, probably. Everyone I know, however, would want to protect from this eventuality.
This scenario isn’t contrived to get the answer to come up the way I want. It is typical, certainly far, far more common than a simple and straightforward DC setup components get swapped instantly, you’re always at maximum efficiency, and every system does the same thing. Reality isn’t like that. Speaking of which…
A (SMALL) DOSE OF REALITY
Let’s complicate things a bit further, or rather, make them just slightly more realistic. Let’s say that your 90% utilization was measured during a time where you had 100 simultaneous users on average. Tomorrow, though, Pando Daily posts a glowing review of your website and usage doubles. Doesn’t sound too far-fetched. Now what? The 90% utilization, if there’s a strong correlation between simultaneous users and load (which is typical), is suddenly guaranteed to bring you down. Suddenly you’d need much lower baseline utilization to be able to handle that spike, and the 90% looks like a bad idea once again.
Going a bit further, imagine the two clusters of five systems are actually performing the same function, but one of them is a V1 (90%) and the other one is the more efficient version you just deployed (10%). This is something that happens all the time. As you can instrument a piece of software running under real-world load, you can understand better what leads to that load, you can optimize your software, and sometimes massive jumps in efficiency are common. So when the 90% is deemed as a problem, the team gets to work, and they come up with a V2 that is around an order of magnitude more efficient.
Which leaves you with two possible versions of the software doing the same work, but one uses 10% of the systems and the other 90%. Which one is better? I think everyone would agree that the newer, more efficient system is better, even though if you deployed it, it would suddenly make your utilization plummets across the board, which, according to the article, is “bad”. Oops.
Hold on, I hear you say. Now that you’ve got software that is more efficient why don’t you just decommission the extra systems you don’t need? Then you’d be using less power overall and you could cut down, say, from 10 machines at 90% to 5 machines at 20%, which should give you a nice margin for error.
Aha! This surely sounds true, but this is also where the oversimplifications I keep bashing get tricky since real world gets in the way. Two interrelated points. First, there aren’t that many (if any) tidy switchovers from V1 to V2. In increasing efficiency you may have introduced bugs. To protect against that, you start having to test (more machines for that!), and even when testing says it’s ok to deploy you will start small — deploy to a few machines only, wait, verify. Deploy a bit more. Wait, verify.
THE MISSING VARIABLE: TIME
The process we just described has an important variable that we haven’t looked at so far: time. As in, in the process of doing this, time passes.
Sounds obvious right? But if this is true, it’s also true that as time passes, requirements change. Requirements, here, encapsulating all the factors, external and internal, that go into delivering your service or product. Whatever you’re doing, you’re not dealing with a static entity, but something that is constantly evolving, both because you’re changing it from inside as you update the software, fix bugs, evolve architecture, and deploy new hardware, but also because the external factors are constantly in flux. Likely, you will now have more people using the systems (or, for things that are just APIs, maybe more machines). Or load may have changed due to a feature.
The passage of time here is the critical element missing, and it gets in the way of the ideal scenario in which we allow the system to truly “contract” and use fewer resources in terms of power. By the time you’re sure you can decommission those extra machines, it’s quite possible that you now have other uses for them, and even if you did find a sliver of time to do this you may not have the option since system growth, or feature changes, or whatever, may be telling you that you will need those extra machines in a week or a month month, and therefore taking them down would be simply a waste of time. At Ning, for example, our overall hardware footprint remained largely stable in terms of number of machines, and therefore total power consumed, even as we went from 1 MM registered users to 50 MM and beyond. Setting aside the fact that this took an enormous amount of work and constant vigilance, it also meant that the system that could handle 50 MM users with the same amount of hardware as for 1MM was very different than the original. And throughout that process, many machines would oscillate in their degree of utilization, from the low teens to the high 70s. Over a period of a few years you could take measurements at different points in time that would either make us appear either like geniuses or criminally stupid — if all you looked at was utilization, and if “less utilization is bad” was your only guiding principle.
If we can agree that low or high utilization alone is meaningless, and that your utilization will fluctuate, maybe drastically, as you optimize, we can start to ask more appropriate questions. For example: can we do better in terms of releasing idle capacity as efficiency increases? Absolutely, and the widespread use of virtualization in recent years also means that it’s now far easier to have capacity fluctuate along with load. The APIs for system management popularized by public cloud infrastructure companies (e.g. EC2, Rackspace cloud, etc.) have led in the last couple of years to more and more services where capacity is instantiated on demand according to load, leading to a more efficient use of those virtualized resources.
Even there, though, we have a problem, for if EC2 allows you to instantiate and tear down a hundred AMIs without giving it a second thought, it’s also necessarily true that there must be an actual hardware footprint doing nothing but waiting around for that to happen. Having people constantly disconnect and reconnect machines is something that is not just not feasible, and that in almost every case will involve as much waste as taking a system down, since in the world we live in the capacity we need to run the ever-increasing complexity of Internet infrastructure keeps going up, which means that whatever you stop using today, you’re guaranteed to need tomorrow. With 60% or more of humans the planet still not online, there’s clearly still a lot of growth left. For the biggest services, it’s common to have to constantly be deploying new capacity just to account for growth.
This leads us to yet another question, perhaps the last one in this particular chain of thought: Should we in general accept less reliability from online services given that it has real environmental impact, and real cost?
My own answer to this would be a clear NO. I don’t think you can have these systems simultaneously be part of the fabric of society (a point I’ve made before) and have them be “partially reliable,” just like there’s no way to be “partially pregnant.” Reliability is intrinsically tied to the usefulness of these services. Perhaps there are ways in which we can bake in more asynchronous behavior in some cases, but when a lot of what systems do is real-time, 24/7/365 and worldwide, this isn’t something we’ll be able to exploit frequently. We have crossed the Rubicon, so to speak, and have to see this through.
THE CONCLUSION, OR, BELATEDLY MAKING SENSE OF THE TITLE OF THIS POST
Utilization in data centers is an important issue, but talking about it bereft of context is not really that useful. In particular, without also talking about efficiency, and all the parameters that go into it including what kinds of applications are running, what the goals are, what the requirements are, etc., is going to leave us with nothing but incomplete answers, and here incomplete will leave us way too close to incorrect for comfort.
And this isn’t just about the Internet. Is it a valid question to talk about utilization in, say, TV stations? Or other major source of media for that matter? Pre-digital music distribution…?. Sure it is, but it’s not something we focus on because the use of energy in other media is a one or two steps removed from what we see, so it’s easier to ignore even if it’s there all the same. When was the last time you remember a TV station couldn’t broadcast? Are we to believe that they never had a power failure? No. They have backup systems. Those evil-sounding lead batteries or diesel generators.
By looking at context and simply shifting the assumption that Internet infrastructure is actually run fairly well, given all the requirements and its rapid evolution, we realize that what we really should be wondering about what makes us build systems this way, rather assuming they are not built properly. Should we talk about utilization? Or is this really about what drives it, and therefore utilization is part of the discussion but not the central point?
Channeling Reverend Lovejoy for a moment, we could say, then: “Short answer yes with an if, long answer, no, with a but…”