diego's weblog

there and back again

a lot of lead bullets: a response to the new york times article on data center efficiency

Part 1 of a series (Part 2, Part 3)

Note: This is 5,000 word post (!) in response to a 5,000-word article, since I thought it necessary to go beyond the usual “that’s wrong”. A detailed argument requires a detailed counter-argument, but, still, apologies for the length. 

As I was reading this New York Times article on data centers and power use, there was mainly one word stretching, doppler-like through my head: “Nooooooo!”

Not because the article exposed some secret that everyone that’s worked on websites at scale knows and this intrepid reporter was blowing the lid on our quasi-masonic-illuminati conspiracy. Not because there was information in it that was in any way shocking.

The reason I was yelling in my head was that I could see, clear as day, how people who don’t know what’s involved in running large scale websites would take this article. Just look at the comments section.

The assertions made in it essentially paint our engineers and operations people as a bunch of idiots who are putting together rows and rows of boxes on data centers and not caring what this costs to their businesses, nay, to the planet.

And nothing could be further from the truth.

There is one thing that the article covers that is absolutely true: data centers consume a hell of a lot of power. Sadly, the rest is a mix of half-guesses, contradictions, and flat-out incorrect information that creates all the wrong impressions, misinforms, and misrepresents the efforts and challenges that the people running these systems face everyday. In the process, the article manages to talk to precious few people that are really in a position to know and explain what’s going on. In fact, when I say precious few, I mean one: Jeff Rothschild, from Facebook, and instead of asking him questions on data centers the article just uses one amusing anecdote from Facebook’s early days that makes eng/ops look like a bunch of monkeys running around with fans.

This isn’t just an incredibly inaccurate representation of the dedication and hard work of eng/ops everywhere in the computer industry, I know for a fact it’s also inaccurate in what regards to Facebook itself. I imagine Facebook engineers (and that of any other website really) reading this article, thinking about the times they’ve been woken up in the middle of the night to solve problems that no one has ever faced before, for which no one has trained them, because no university course and no amount of research prepares you for the challenges of running a service at high scale, and having to solve all that as fast as possible, regardless of whether it’s about making sure that someone can run their business, do their taxes, or that a kid halfway around the world can upload their video of a cat playing the piano.

Before I continue, let me say that, even if I am (clearly) a bit miffed, I respect the efforts of the reporter, even with the inadequate sources that the story quotes, and there is an important story here but it’s not the one he focused on. The question is not that there’s inefficiency, or rather, under-utilization of power. There’s some but not as much (the main figure the article quotes is that data centers, or DCs for short run at 6-12% utilization is just completely made up by consultants and I can’t possibly imagine how they arrived at it).

The question is why.

Sure, there’s some inefficiency. But why? There are many reasons, but before I get to them, let me spend some time on the problems in the article itself beyond this central issue.

The problems in the article

Let me go through the biggest issues in the article and debunk them. To begin:

Energy efficiency varies widely from company to company. But at the request of The Times, the consulting firm McKinsey & Company analyzed energy use by data centers and found that, on average, they were using only 6 percent to 12 percent of the electricity powering their servers to perform computations. The rest was essentially used to keep servers idling and ready in case of a surge in activity that could slow or crash their operations.

First off, an “average,” as any statistician will tell you, is a fairly meaningless number if you don’t include other values of the population (starting with the standard deviation). Not to mention that this kind of “explosive” claim should be backed up with a description of how the study was made. The only thing mentioned about the methodology is that they “sampled about 20,000 servers in about 70 large data centers spanning the commercial gamut: drug companies, military contractors, banks, media companies and government agencies.” Here’s the thing: Google alone has more than a million servers. Facebook, too, probably. Amazon, as well. They all do wildly different things with their servers, so extrapolating from “drug companies, military contractors, banks, media companies, and government agencies” to Google, or Facebook, or Amazon, is just not possible on the basis of just 20,000 servers on 70 data centers.

Not possible, that’s right. It would have been impossible (and people that know me know that I don’t use this word lightly) for McKinsey & Co. to do even a remotely accurate analysis of data center usage for the industry to create any kind of meaningful “average”. Why? Not only because gathering this data and analyzing it would have required many of the top minds in data center scaling (and they are not working at McKinsey), not only because Google, Facebook, Amazon, Apple, would have not given McKinsey this information, not only because the information, even if it was given to McKinsey, would have been in wildly different scales and contexts, which is an important point.

Even if you get past all of these seemingly insurmountable problems through an act of sheer magic, you end up with another problem altogether: server power is not just about “performing computations”. If you want to simplify a bit, there’s at least four main axis you could consider for scaling: computation proper (e.g. adding 2+2), storage (e.g. saving “4” to disk, or reading it from disk), networking (e.g. sending the “4” from one computer to the next) and memory usage (e.g. storing the “4” in RAM). This is an over-simplification because today you could, for example, split up “storage” into “flash-based” and “magnetic” storage since they are so different in their characteristics and power consumption, just like we separate RAM from persistent storage, but we’ll leave it at four. Anyway, these four parameters lead to different load profiles for different systems.

The load profile of a system can tell you its primary function in abstract terms. Machines in data centers are not used homogeneously. Clusters of them may be primarily used for computation, other clusters for storage, others have a mixed load. For example, a SQL database cluster will generally use all four heavily, while a cluster that serves as a memory cache would only use RAM and network heavily. As an aside, network is important in terms of power consumption not in and of itself, since running a network card is a rounding error in terms of power consumption, but because to run a big network infrastructure requires switches in heavily redundant configurations that actually do count, not to mention the fact that a bigger network equals more complexity that has to be managed, but we’ll get to that in more detail later.

I don’t doubt that McKinsey got a lot of numbers from a lot of companies and then mashed them together to create an average. I’m just saying that a) it’s impossible to have the numbers measure the exact same thing across so many different companies that are not coordinating with each other, and b) that the average is therefore of limited value since the load profile of all of these services is so wildly different that if you don’t have accurate data for, say, Google, your result is meaningless.

1. Right from the start, one of the primary legs on which the article stands on is either incorrect, or meaningless, or both.

Moving on.

A server is a sort of bulked-up desktop computer, minus a screen and keyboard, that contains chips to process data.

No, that’s not what a server is. It is not a “bulked-up desktop”. In fact, the vast majority of servers these days are probably as powerful as a typical laptop, minus battery, display, graphics card and such. And in many cases the physical “server” doesn’t even exist since everyone doing web at scale makes extensive use of virtualization, either by virtualizing at the OS level and running multiple virtual machines (in which case, yes, perhaps that one machine is bigger than a desktop, but it runs several actual server processes in it) or distributing the processing and storage at a more fine-grained level (MapReduce would be an example of this). There’s no longer a 1-1 correlation between “server” and “machine,” and, increasingly, “servers” are being replaced by services.

The reason this seemingly minor thing matters is that it creates an image that a datacenter is readily comprehensible. Just a lot of “bulked up desktops.” I understand the allure of this analogy, but it’s not true, and it creates the perception that scaling up or down is just a matter of adding or removing these boxes. Sounds easy, right? Just use them more efficiently!

2. Servers are not “bulked up desktops” and this is critical because infrastructure isn’t just a lot of bricks that you pile on top of each other. There’s no neat dividing line. You can’t just say “use less servers” to increase efficiency, which is what the incorrect analogy leads to.

Next!

“This is an industry dirty secret, and no one wants to be the first to say mea culpa,” said a senior industry executive who asked not to be identified to protect his company’s reputation. “If we were a manufacturing industry, we’d be out of business straightaway.”

Say what? That infrastructure is inefficient is a dirty secret? First off, inefficiency is not a secret, and it’s not “dirty.” Maybe Google or Facebook don’t publish papers talking about their utilization, but the distance between not shouting this from the rooftops and this being a “dirty secret” is indeed long.

This statement, from an anonymous source no less, matters because it creates the sense in the article that the industry is operating in the shadows, trying to hide a problem that, “if only people knew” would create a huge issue. Nothing could be further from the truth. There are hundreds of conferences and gatherings a year, open to the public, where the people that run services get together to discuss these problems, solutions, and advances. Everyone I know talks about how to make things better, and, without divulging company secrets, exchanges information on how to solve issues that we face. There’s ACM and IEEE publications (to name just two organizations), papers, and conferences. The reason it appears hidden is that it’s just one of the many areas that only interest the people involved, primarily nerds. It’s arcane, perhaps, but not hidden.

3. This isn’t any kind of “industry dirty secret.” This statement only helps in making this appear to be part of some conspiracy which doesn’t exist and papers over the real issues by shifting attention to the supposed people who keep this “dirty secret.” That is, it seems to identify a group of people at which we can point our proverbial pitchforks, and nothing else.

Next!

Even running electricity at full throttle has not been enough to satisfy the industry. In addition to generators, most large data centers contain banks of huge, spinning flywheels or thousands of lead-acid batteries — many of them similar to automobile batteries — to power the computers in case of a grid failure as brief as a few hundredths of a second, an interruption that could crash the servers.

“It’s a waste,” said Dennis P. Symanski, a senior researcher at the Electric Power Research Institute, a nonprofit industry group. “It’s too many insurance policies.”

The first paragraph in this quote seems to imply that batteries (“lead-acid batteries” which sounds more ominous than just “batteries”) and “spinning flywheels” are there because the industry is “not satisfied with running electricity at full throttle”.

To begin — how do you run electricity at less than “full throttle”? Is there some kind of special law of physics that I’m not aware of that lets you use half an electron? If you need 2 amps at 110 volts, that’s what you need. If you need half, you use half. There’s no “full throttle.” A system draws the power it needs, no more, no less, to run its components at a particular point in time. Could you build machines that are more efficient in their use of power? Of course, and people are working on that constantly. Google, Facebook, Amazon, all spend hundreds of millions of dollars a year running data centers, and a big chunk of that is power. It’s a cost center that people are always trying to reduce.

Then there’s the money quote from Mr. Symanski, someone I’ve never heard of, saying that “It’s a waste. It’s too many insurance policies.”

Really? Let’s look at what happens when a data center goes offline. How many the-sky-is-falling articles have been written when significant services of the Internet are affected? A ton, of course, many of them in The New York Times itself. Too many insurance policies? Not true. Managing these systems is incredibly complex. Eng/ops people don’t deploy systems for the fun of it. We do this because it’s required, and, if anything, we do less than we know we should because there’s never enough money or people or time to deploy the right solution, so we make do with what we can and make up for the difference with a lot of hard work.

4. It’s not “too many insurance policies.” Redundant systems in data centers aren’t perfect by any stretch of the imagination, but the article strongly implies that this is being done willfully and flying in the face of evidence that says that’s unnecessary. This is flat-out not true.

These next two paragraphs are just incredible in how they distort information:

A few companies say they are using extensively re-engineered software and cooling systems to decrease wasted power. Among them are Facebook and Google, which also have redesigned their hardware. Still, according to recent disclosures, Google’s data centers consume nearly 300 million watts and Facebook’s about 60 million watts.

It quotes essentially unnamed sources at Facebook and Google (or their PR people) in saying that they are using software and cooling systems to decrease wasted power, and then it goes on to say that they still consume a lot of power, creating the impression that they are not really solving the problem. Here’s the thing though: if you built a more efficient rocket to go to the moon, it would still consume a lot of power. No one would be shocked to learn that, would they?

But there’s more!

Many of these solutions are readily available, but in a risk-averse industry, most companies have been reluctant to make wholesale change, according to industry experts.

This is probably the one paragraph in the article that sent my head spinning. It makes it appear as if there’s “solutions” that everyone knows about but no one uses because we’re a “risk-averse” industry.

What “solutions”? Who is “risk-averse”? Which companies are “reluctant to make wholesale change”? Who are the “industry experts” that assert this is the case?

This last paragraph is quite simply, flat-out factually wrong. There are no “solutions” that are “readily available.” There simply aren’t. The article later quotes some efforts around higher utilization and from companies who are working in the area that imply that if everyone just did whatever these people are doing, everything would be better. And I say “whatever these people are doing” to capture the flavor of the article, since what they are doing is not magic, they just seem to be more efficient at queuing and batching processing. The problem is that not everything can be efficiently batched, certainly not when your usage isn’t dictated by your own schedule but by end users around the world, and each particular use case requires its own solutions. We’re not dealing with one problem but many, but I’ll get to that in a moment.

5. There are no “readily available” solutions to this problem, because there isn’t just one problem to solve. There’s multiple overlapping challenges that are all different and sometimes contradictory factors (e.g. exchange utilization for failover capabilities), and no one is “reluctant to make wholesale change” — these are huge, complex systems that can’t be just replaced for something else.

(And by the way, how ironic is it that an article about “an industry reluctant to make wholesale change” is being run by… a newspaper?)

Finally, just to wrap up since I’ve gone on way longer than I planned on the quotes and I don’t want to reprint the whole article, I want to touch on the contradictions. One giant contradiction in the article is that it talks throughout about how the “industry” (which by the way, is never defined clearly, although I’m assuming we are talking about the computer industry in general, since by now everyone pretty much uses data centers?) is “risk averse” while also covering the sheer scale of the infrastructure that exists. Most of this infrastructure has been built up over the last ten years. Google, Facebook, Amazon, and everyone else all have ramped up, by orders of magnitude, their operations over the last five years. None of these companies even existed 15 years ago!. The social web and mobile have exploded in the last 3 years. There’s simply no way the industry can simultaneously build up this massive infrastructure that sustain exponential growth rates in traffic and usage and be so hugely risk averse. Maybe that will be true in a couple of decades. It isn’t true now, and whatever “expert” the reporter has talked to has not really been involved in running a modern data center. A lot of quotes in the article seem to come from people at “The Uptime Institute,” an organization I’ve never heard about, so maybe for a followup they should talk to the people who are actually running these systems at Facebook, Google, and others.

Nationwide, data centers used about 76 billion kilowatt-hours in 2010, or roughly 2 percent of all electricity used in the country that year, based on an analysis by Jonathan G. Koomey, a research fellow at Stanford University who has been studying data center energy use for more than a decade. DatacenterDynamics, a London-based firm, derived similar figures.

The industry has long argued that computerizing business transactions and everyday tasks like banking and reading library books has the net effect of saving energy and resources. But the paper industry, which some predicted would be replaced by the computer age, consumed 67 billion kilowatt-hours from the grid in 2010, according to Census Bureau figures reviewed by the Electric Power Research Institute for The Times.

This represents another contradiction which spans the whole article: there are many mentions of how it’s impossible to accurately measure how much power is being used, and there’s as many specific numbers thrown around of the power being used. Which is it? Do we know, or don’t we?

But the clincher here is that the “industry” which “has long argued that computerizing business transactions and everyday tasks like banking and reading library books has the net effect of saving energy and resources” used, apparently 76 billion kW-hours 2010, “but” the paper industry used 60 billion kW-hours in 2010. I’m sorry, what? The paper industry? Are we counting Post-Its here? Kitchen paper towels? And assuming these figures are accurate, in what universe can we compare the power used by compute devices as a whole with the power used by paper-making companies and not only pretend that there’s any equivalence, but also imply that the computer industry has failed because the paper industry is still huge? And that’s the implication created by the paragraph, no question about it.

The real issue: why?

So as I said at the beginning, the article goes off in what I can only characterize as sensationalistic attacks on data center technology while avoiding the real question: why?

Why do data centers consume so much power? Why are there multiple redundancies? Why is it that utilization is not 100%?

Power consumption

First — data centers consume a lot of power for the simple reason that we’re doing a lot with them. The article touches upon this very briefly. Massive amounts of data are flowing through these systems, and not just by consumers. The data has to be stored, processed, and transmitted, which requires extremely large footprints and therefore huge power consumption. It’s simple physics.

Contrary to what the article states, data centers have undergone drastic evolution in the last ten years, and continue to do so. There’s an incredible amount of work being done to make data centers and infrastructure work better, be more efficient, cheaper, faster, more reliable, you name it. However this leads to some inefficiency. You can’t replace everything at once. New systems and approaches have to be tested and re-tested, then deployed incrementally. There’s no silver bullet.

There’s also the issue of complexity, evolution of requirements, and differences of usage, all of which also lead to inefficiency, which I’ll come back to in the end.

Redundancies

The article strongly implies, more than once, that unnecessary redundancies, fear of failure, and so forth, are one of the key reasons for inefficiency. This is, as I’ve said above, completely untrue. The redundancies that web services today have is not excessive. They are the best way each company has found to solve the challenges they have faced and continue to face. No doubt if you took any one system within any one company and did a deep analysis you could find elements to optimize, but that doesn’t mean that the solutions are feasible. When you deal with complex systems it’s often the case that unintended consequences are lurking at every turn, and the obvious way to avoid a catastrophic failure to is to have backup systems. Airplanes have multiple redundant systems exactly for this reason.

And before you say that comparing an airplane with web services is a bad analogy, imagine the type of disruption you’d have, worldwide, if Google was down for a day. Or if hundreds of millions of people couldn’t access their email for a day. The planet freaks out when Gmail or Twitter is down even for a couple of hours! Even services that are less obviously critical, like Facebook or Twitter, would generate huge disruptions if they disappeared. Dictators that are constantly trying to shut down access to these services know why: these services, frivolous as they sometimes appear, are part of the fabric of society now and crucial in how it functions and how it communicates. That aside, there’s of course the minor matter that they are run by for-profit companies, and if the service isn’t running they aren’t making any money.

Speaking of money, that’s an important point lost in the article. People don’t just, as a rule, throw tens of millions of dollars at a problem when they can avoid it. You can count on the fact that if there was a way to provide the same reliability with significantly less money, they would do it. A natural thing to do would have been to go to the CFO of any of these companies and ask: “look, I just uncovered this massive inefficiency and waste, why are you wasting money like this?” But that would go against the narrative that these companies are just gripped by fear of change and either they don’t care that they are burning through hundreds of millions of dollars for no reason, or they are so stupid that they don’t even realize it.

Over-capacity
Third — data center utilization is not 100%. This is true. But back to airplanes, it’s also true that on a typical flight there’s empty seats, because you need overcapacity. Similarly, no one runs a large web service without at least 25% or more of spare capacity, and in some cases you need more. Why? Many reasons, but, first, there’s spikes. Usage of services on the web (whether directly through a website or through cellphone apps, or even backend services) is often hard to predict, and in many cases even if predictable it is incredibly variable. There are spikes related to events (e.g. The Olympics) and micro-spikes within those spikes that push the system even further. To use a recent example, everyone talked about how at the conventions a few weeks ago Mitt Romney’s acceptance speech generated something like 15,000 tweets/second while Obama’s speech peaked at around 45,000 tweets/second. What no one is asking is how it is possible that the same system that handled 15,000 tweets/second could also handle 45,000 only a few days later. Twitter didn’t just triple its capacity for that speech only to tear it down the next day, right? The answer is simple: overcapacity, planned and deployed well in advance. And if it didn’t have the capacity ready, what would have happened? Would we have seen an article congratulating Twitter for saving power and not running a capacity surplus? Or would the web have exploded with “Twitter goes down during convention” commentary? It’s not hard to guess.

Another cause for over-provisioning is quite simply fast growth. These systems cost millions of dollars to buy, deploy, test and launch. This is not the kind of thing you can do every day. So you plan the best you can and deploy in tranches, to make sure that you have enough for the next step in growth, which means that at any point in time you are running with more capacity than you need, simply because even though it’s more capacity than you need today it’s going to be less than what you need tomorrow.

There’s also bugs. Code that is deployed sometimes isn’t hyper-efficient or work exactly as intended (what? shocking!). When that happens, having extra capacity helps prevent a meltdowns.

In the process of deploying you need to test. Testing requires you create duplicates of a lot of infrastructure so that you can verify that the site just won’t stop running when you release a new feature. Testing environments are far smaller than production environments, but they can still be sizable.

Another huge area that requires fluctuating (and yet, ever-increasing) capacity is analytics. You need to make sense of all of the data, be able to know when to run ads, and figure out where the bottlenecks are in the system so you can optimize it. That’s right, before you can optimize any system you need to first make it bigger by creating an analytics infrastructure so that you can figure out how to make it smaller.

Then there’s attacks on infrastructure that have to be survived, since you can’t prevent all of them. The article makes absolutely no mention of this, but before you can detect that an attack is happening, you have to be able to withstand it. So any half-decent web infrastructure has to have the ability to handle an attack before it can be neutralized.

Being ready for spikes and growth, testing and deployment, data analysis, etc, all of this requires overcapacity.

If the article was talking about the human immune system, it would have said something like “look at all of those white cells in the body, doing nothing most of the time, what a waste.” But the truth is that they’re there for a reason.

The common thread: complexity

One final point I mentioned I’d get to a couple of times is complexity. It’s the common thread among all of these reasons, and it doesn’t make for nice soundbites. Starting with the issue of load-profiles that I touched on at the beginning, all the way to the problem created by constant (and sometimes extremely fast growth) at the end, we are facing challenges without precedent, and the solutions are often imperfect.

On top of that, requirements change quickly, today’s popular feature is tomorrow’s unused piece of code, and even within what’s popular you can have usage spikes that are impossible to plan for and that therefore you have to solve on the fly, leading to less-than-perfect solutions.

There are no “well known solutions” because each problem is unique. Even within the same domain (say, social networks) you have a multitude of scaling challenges. Scaling profile pages is drastically different than scaling a page that contains a forum. Even for what superficially appears to be the same challenge (e.g. Profile pages) each company has different features and different approaches which means that, say, Google’s solutions for scaling Google+ profiles has very little in common with Facebook’s solution for scaling their profiles. Even when the functionality is similar, there are a multitude of factors, such as the business model, that drive what parameters you need to scale for each case. There’s simply no one-size-fits-all solution.

The people working to build our digital infrastructure are extremely talented, work extremely hard, and are facing problems that no one has faced before. This is not an excuse, it’s just a measure of the challenge, which we take on gladly. Sure, it’s not perfect, but it’s what we can do. It’s humans, flaws an all, running these systems. The alarmist and sensationalist tone in the New York Times article, coupled with the insinuation that the solution exists but no one wants to use it, is doing everyone a disservice. Solving these challenges requires continued work, incremental improvements, and a lot of focus.

Or, as Ben Horowitz quoted (in one of my favorite quotes of all time) Bill Turpin as saying during their days at Netscape: “There is no silver bullet that’s going to fix that. No, we are going to have to use a lot of lead bullets.”

Amen, brother.

Part 1 of a series (Part 2Part 3)

2 responses to “a lot of lead bullets: a response to the new york times article on data center efficiency

  1. Pingback: santa claus conquers the martians « diego's weblog

  2. Pingback: short answer yes with an if, long answer, no, with a but… « diego's weblog

Follow

Get every new post delivered to your Inbox.

Join 383 other followers

%d bloggers like this: