diego's weblog

there and back again

Category Archives: software

maybe because both words end with “y”

In an an apparent confusion between the word “utility” and the word “monopoly,” the Wall Street Journal runs an opinion piece today called “The Department of the Internet” that has to be one of the most disingenuous (and incoherent) efforts to attack Net Neutrality I’ve seen in recent times. The author, currently a hedge fund manager and previously at Bell Labs/AT&T, basically explains all of the ways in which AT&T slowed down innovation, either by omission, errors of judgment, or willful blocking of disruptive technologies.

All of them because, presumably, AT&T was classified as a “utility.” I say “presumably” because at no point does the piece establish a clear causal link between AT&T’s service being a utility and the corporate behavior he describes.

Thing is, AT&T behaved like that primarily because it was a monopoly.

And how do we know that it was its monopoly power that was the primary factor? Because phone companies never really stopped being regulated in the same way — and yet competition increased after the breakup of AT&T. In fact, you could argue that regulation on the phone system as a whole increased as a result of the breakup.

Additionally, it was regulation that forced companies to share resources they otherwise would never have. In fact the example of “competition” in the piece is exactly an example of government intervention similar to what Net Neutrality would do:

“The beauty of competition is that you get network neutrality for free. AT&T cut long-distance rates in the 1980s when MCI and Sprint started competing fiercely.”

Had the government not intervened in multiple occasions (whether in the form of legislation, the Courts, or the FCC, and most dramatically with the breakup), AT&T would never have allowed third parties to sell long distance to their customers, much less at lower rates than them.

There’s more than one fallacy on the piece on how “utilities are bad”:

A boss at Bell Labs in those days explained what he called the Big Lie, using water utilities as an example. Delivering water involves mostly fixed costs. So every decade or so, water companies engineer a shortage. Less water over the same infrastructure meant that they needed to raise rates per gallon to generate returns. When the shortage ends, they spend the extra money coming in on fancy facilities, thus locking in the higher rates for another decade.

So — someone, decades ago, gave an example of the corruption of water companies to the author, and regardless of whether this “example” is true or not, real, embellished or a complete fabrication, and regardless of whether the situation is, I don’t know, maybe a little different half a century later and dealing with bits and not water molecules, it’s apparently something good to throw out there anyway. (In fact, I struggle to see exactly what AT&T could do that would be analogous to the abuse he’s describing).

Again, this is presumed, since no causal link is established in the sense that if true, the described ‘bad behavior’ is conclusively the result of something being a utility rather than, well, any other reason, like corruption, incompetence, or just greed.

To close — I’ve seen that a number of people/organizations (many but not all of them conservatives) are opposed to Net Neutrality. My understanding is that this is because of fear of over-regulation. Fair enough. Have any of them thought how it would affect them? Perhaps it’s only when it’s implemented that they will realize that their readers/customers, by an overwhelming majority, have little choice of ISPs. Very few markets have more than two choices, and almost no markets have competitive choices (ie, choices that are at equivalent levels of speed or service).

But I’m sure that the Wall Street Journal, or Drudge, or whoever will be happy to pay an extra fee to every IP carrier out there so their pages and videos load fast enough and they don’t lose readers.

Right?

what a startup feels like (sometimes)

That is all.

the apple developer center downpocalypse

appledevcenter

We’re now into day three of the Apple Developer Center being down. This is one of those instances in which Apple’s tendency to “let products speak for themselves,” an approach that ordinarily has a lot going for it, can be counterproductive. In three days we’ve gone from “Downtime, I wonder what they’ll upgrade,” to “Still down, I wonder what’s going on?” to “Still down, something bad is definitely going on.”

Which, btw, is the most likely scenario at this point. If you’re ever been involved in 24/7 website operations you can picture what life must have been like since Thursday for dozens, maybe hundreds of people at Apple: no sleep, constant calls, writing updates to be passed along the chain, increasingly urgent requests from management wanting to know, exactly, how whatever got screwed up got screwed up, and that competing with the much more immediately problem of actually solving the issue.

And a few people in particular, likely less than a dozen, are under particular pressure. I’m not talking about management (although they have pressure of their own) but the few sysadmins, devops, architects and engineers that are at the center of whatever team is responsible for solving the problem, which undoubtedly was also in charge of the actual maintenance that led to the outage in the first place, so the pressure is multiplied.

Even for global operations at massive scale, this is what it usually comes down to — a few people. They’re on the front lines, and hopefully they know that some of us appreciate their efforts and that of the teams working non-stop to solve the problem. I know I do.

The significance of the dev center is hard to see for non-developers, but it’s real and this incident will likely have ripple effects beyond the point of resolution. Days without being able to upload device IDs, or create development profiles. Schedules gone awry. Releases delayed. People will re-evaluate their own contingency plans and maybe question their app store strategy. Thousands of developers are being affected, and ultimately, this will affect Apple’s bottom line.

And that’s why this situation is not the kind of thing that you’ll let go on for this long unless there was a very, very good reason (only a couple of days from reporting quarterly results, no less). Maybe critical data was lost and they’re trying to rebuild it (what if everyone’s App IDs just went up in smoke?). Maybe it was a security breach (what if the root certs were compromised?). The likelihood that there will be consequences for developers, as opposed to just a return to the status quo, goes up with every hour that this continues. As Marco said: “[…]  if you’re an iOS or Mac App Store developer, I’d suggest leaving some free time in the schedule this week until we know what happened to the Developer Center.”

In fact, it could be that at least part of the delay has to do with coming up with procedures and documentation, if not a full-on PR strategy. Apple hasn’t traditionally behaved this way, but Tim Cook has managed things very differently than Steve Jobs on this regard.

Finally, I’ve been somewhat surprised by the lack of actual reporting on this. One day, maybe two days… but three? Nothing much aside from minor posts on a few websites, and not even much on the Apple-dedicated sites. This is where real reporting is necessary. Having sources that can speak to you about what’s going on. Part of the problem is that the eventual impact of this will be subtle, and modern media doesn’t do subtle very well. It’s less about the immediate impact or people out of a job than about a potential gap in future app releases. A whole industry is in fact dependent on what goes on with that little-known service, and with iOS 7/Mavericks being under NDA, Apple’s developer forums, which are also down, are the only place where you can discuss problems and file bug reports. Some developer, somewhere, is no doubt blocked from being able to do any work at all. 

Apple should, perhaps against its own instincts, try their best to explain what happened and how they’ve dealt with it. Otherwise, the feeling that this will just happen again will be hard to shake off for a lot of people. For Apple, this could be an opportunity to engage with their developer community more directly. Here’s hoping.

diego’s life lessons, part III

Excerpted from the upcoming book: “Diego’s life lessons: 99 tips for survival, fun, and profit in today’s baffling bric-a-brac world.” (see Part I and Part II).

#9 make the right career choices

Everyone will have seven careers in their lifetime, someone said once, and we all repeated it even if we have no idea why.

The key to career planning, though, is to keep in mind that while the world of today ranges from complicated to downright baffling, the world of tomorrow will be pretty predictable, since as we all know it will just be a barren hellscape populated by Zombies.

So the question is: post-Zombie Apocalypse, what will you need to be? Survival in the new Zombie-infested world will require the skills of any good D&D party: a Healer, a Warrior, a Thief, and a Wizard — which in a world without magic means someone to tinker with things, build weapons, design shelters with complicated spring traps, and knowledge of how to brew a good cup of coffee.

Clearly you don’t want to be a Healer (read: medic/doctor), since that means no one will be able to fix you — you should have friends or relatives with careers in medicine, however, for obvious reasons. Being a Thief will be of limited use, but more importantly it’s not really the kind of thing you can practice for without turning to a life of crime as defined by our pre-Zombie civilization (post-Zombies, most of the things we consider crimes today will become fairly acceptable somehow, so you may be able to pull this off with the right timing).

That leaves you with either Warrior or Wizard, which translates roughly to: Gun Nut or Hacker. And by “Hacker” we mean the early-1980s definition of hacker, rather than the bastardized 2000s version, and one that is not restricted to computers.

So. Your choices for a new career path are as follows:

  • If you’re a Nerd, become a Hacker.
  • If you’re neither a Nerd or a Hacker, just become a Gun Nut, it’s the easiest and fastest way to post-apocalyptic survival. This way, while you wait for Zombies to strike you won’t need to worry (for example) about a lookup being O(N) or not, or why the CPU on some random server is pegged at 99% without any incoming requests.
  • If you’re already a Gun Nut, you’re good to go. Just keep buying ammo.
  • If you’re already a Hacker… please don’t turn into an evil genius and destroy the world. Try taking up some activity that will consume your time for no reason, like playing The Elder Scrolls V: Skyrim or learning to program for Blackberry.

NOTE (I): If you’re in the medical profession, just stay put. We will protect you so you can fix our sprained ankles and such.
NOTE (II): there is also the rare combination of Hacker/Nerd+Gun Nut, but you should be aware that this is a highly volatile combination of skills which can have unpredictable results on your psyche.

#45: purchase a small island in the Pacific Ocean

As far as having a permanent vacation spot, this one really is a no-brainer. Why bother with hotels when you can own a piece of slowly sinking real estate? Plus, according to highly reliable sources, you don’t need to be a billionaire.

True, you will have significant coconut-maintenance fees and you’ll probably need a small fleet of Roombas to keep the place tidy, but coconuts are delicious and the Roombas can help in following lesson #18.

NOTE I: don’t be fooled by the “Pacific” part of “Pacific Ocean.” There’s nothing “pacific” about it. There’s storms, cyclones, tsunamis, giant garbage monsters, sharks, jellyfish, and any number of other dangers. Therefore, an important followup to purchase the island is to buy an airline for it. You know, to be able to get away quickly, just in case.

NOTE II: this is actually an alternative to the career choices described above, since it is well known that Zombies can’t swim.

NOTE III: the island should not be named Krakatoa — see lesson #1. Aside from this detail, owning a Pacific Island does not directly conflict with lesson #1, since the cupboard can be actually located in a hut somewhere in the island (multiple cupboard hiding spots are also advisable).

#86 Stock up on Kryptonite

Ok, so let me tell you about this guy… He wears a cape and tights. He frequently disrobes in public places. He makes a living writing for a newspaper with an owner that makes Rupert Murdoch look like Edward R. Murrow. He has deep psychological scars since he is the last survivor of a cataclysmic event that destroyed his civilization. He leads a secret double life, generally disappearing whenever something terrible happens. He is an illegal alien. Also, he is an ALIEN.

Does this look like someone trustworthy to you? Hm?

That’s right. This is not a stable person.

Add to the list that he can fly, even in space, stop bullets, has X-ray vision, can (possibly) travel back in time and is essentially indestructible. How is this guy not a threat to all of humanity?

Lex Luthor was deeply misunderstood — he could see all this, but his messaging was way off. Plus there were all those schemes to Take Over The World, which should really be left to experts like genetically engineered mice.

The only solution to this menace is to keep your own personal stash of Kryptonite. Keep most of it in a cupboard (see lesson #1) and a small amount on your person at all times.

After all, you never know when this madman will show up.

the reason behind windows phone’s dominance in some geographies

via daringfireball, Nick wingfield points to places in the world where Windows Phone is outselling iPhone. Gruber notes, correctly, that these are not Apple strongholds. Blackberry is also extremely popular in those geographies.

What is special about those places? Is it that they have some cultural quirk that prevents them from appreciating iOS?

No. It’s about exchange rates and import controls.

Imports to Argentina, for example, are effectively frozen. People can’t get all sorts of things, from books to electronics. Simple kitchen appliances are in some cases hard to come by. Anecdotally, I can say with some degree of certainty that people would love to get Apple products, and yet Apple products are in extremely short supply since the government denies import licenses unless you export the same amount. Car companies export grains so they can bring in cars. RIM set up a factory in the country just so they could sell phones (you can imagine Apple, given its size and scale, didn’t bother).

As reference, see this businessweek article:

After months of negotiations, [BMW] figured out a fix. The government agreed to let in BMW’s vehicles as long as the company’s Argentine subsidiary exported an equivalent amount of upholstery leather, car parts … processed rice. Echeagaray worked a deal with the Ministry of Industry to get the necessary import permits.

Russia and India are not exactly the same story but match shades of it. The exchange rate factor is a big issue too (more so in Russia and India than in Argentina) — cost of Apple products translates more directly in dollar terms, since they are manufactured in a few locations worldwide and then priced in dollar terms, as opposed to in the local manufacturing and pricing in local currencies. This makes them expensive. No doubt Apple is making a conscious decision here to avoid devaluing their products in real terms.

assume good intentions

A good friend once told me: “Assume good intentions.” Those three words have been hugely influential in my world view in the last few years. Once you make this idea explicit it can shape how you think about what others do in significant ways.

I was reading today about some of the brouhaha surrounding Lean In and the whole why-is-a-billionaire-woman-telling-women-everywhere-what-to-do thing and there was a reference for the launch of Circles.

Gina & Team: congratulations on the launch, it must have been a crazy effort and it looks great.

It seems it’s been building up for a while (the controversy around the book, that is) but I had not seen it until today when I read this article in The New Yorker.

Why I bring this up is that what keeps coming back to me in all of this is how our perspective in the Valley is sometimes clouded by second-hand opinions, innuendo, and gossip, for example around who got funded by whom or which idea is “in”. Yes, this is not unique to the Valley, but it happens frequently here and so I can attest to it, in my own backyard (so to speak… the actual inhabitants of my shared backyard are bluebirds and squirrels).

Putting yourself out there, through a book, art, or even, yes, software, is a hard thing to do. People misunderstand and misinterpret your intentions and motivations constantly, and the schadenfreude that is sadly all-too-common makes things even harder. But we are all just people, trying to do the best we can. The number of significant zeros in your bank account doesn’t change that in most cases. And I say that  having very few significant zeros left in my own bank account.

But, funny thing (not ha-ha funny), most of the people that have such strong opinions on these things have never done them. They “talk about the book” without having “read the book.” (You really need to read The New Yorker article to get this reference). Some of my brothers-in-arms work at Evernote, but do they get press and coverage when they “just” keep an awesome service/app running? No. They get press when someone breaks into their systems.

Controversy sells.

Don’t get me wrong: critics are good> But it’s a matter of degrees. I’m not saying you need to write a book to be able to critique a book, or that you need to start a company to be give your opinion on how ist should be run, but at the very least spend a moment and consider the effort involved. Avoid ad hominems. Forget about money for a second. Consider how much of their lives these people are sacrificing trying to do something.

Assume good intentions.

I bet that if you did that you’d find yourself a bit more forgiving of missteps, a bit more understanding, a bit more willing to believe.

And for those who are doing it, regardless of the scope or (apparent) size of your project, here’s something I could not say out loud because it would sound terrible given my accent… but I can write it: Gina, Sheryl, and all of you out there who are putting yourselves, your sanity, on the line for an idea: Give ‘em hell.

:-)

short answer yes with an if, long answer, no, with a but…

Part 3 of a series (Part 1, Part 2)

HERE WE GO AGAIN

I will look at this from one more angle and then I will let it rest here for future reference, since pretty much everyone else seems, not surprisingly, to have moved on. With the aside on how we go about discussing this topic out of the way (and various other digressions) in my post last Sunday, I wanted to focus a bit on what is perhaps the center of the argument used in the Times article. At the very least, elaborating on the flaws the center should get us very close to exposing the feebleness of the rest of the argument’s construction.

At the core of the argument is the following paragraph:

Energy efficiency varies widely from company to company. But at the request of The Times, the consulting firm McKinsey & Company analyzed energy use by data centers and found that, on average, they were using only 6 percent to 12 percent of the electricity powering their servers to perform computations. The rest was essentially used to keep servers idling and ready in case of a surge in activity that could slow or crash their operations.

In my response I took issue with this paragraph in two specific areas: 1) that an average is meaningless without more information (e.g. the standard deviation, for starters) and 2) that the measure of utilization they imply, and I say imply because they never make it clear –another flaw–, was one of “performing computations.” I elaborated a bit on the various types of tasks servers may be performing and noted that you couldn’t amalgamate them all into a single value. This is true, but I want to step back a bit into what utilization means and the unmentioned factor that is on the other side of it: efficiency.

WHERE IT BECOMES CLEAR THAT TERMINOLOGY MATTERS (A LOT)

Semantics time: we need to define some terms. Let’s say, just for a moment, to simplify the discussion a bit, that we’re ok talking about utilization as some kind of aggregate. Let’s say that we ignore the specifics of what the servers are doing and we further assume that we will use a percentage of utilization of the “system,” to some percent between 0 and 100, with zero being the system is doing nothing but its own minimal housekeeping tasks, so zero isn’t really zero, but we’re simplifying, and 100 being the system is fully utilized doing something specifically related to the application at hand. I should start though, with some terminology housekeeping by defining what “a system” is.

Definition 0: I will use the words system and machine interchangeably to mean a particular piece of hardware or virtual instance, typically a server, running a particular piece of software. Mixing up virtual systems and machines with actual hardware is a bit of a shortcut, but since in practice virtual systems must eventually map to a real one, it’s one that I think we can live with. (Another shortcut lies in “typically a server” since, say, network switches should also be part of the equation.)

Definition 1: Utilization: a percentage between 0 and 100 of system load related to the specific application tasks at hand.

I shudder at the oversimplification, but I’ll get over it. Probably.

Now, related to utilization is efficiency. While utilization can be said to be an objective concept that is measurable, efficiency can only really be understood relative to something else, and in the case of the a piece of software as relative to previous versions itself. So for example we can’t really say anything reasonable about the efficiency of a V1 piece of software except with respect to imagined possible changes in the future. Conversely, the efficiency of V2, V3, etc will be defined in relation to whatever version or versions preceded them. Now that we have the definition of utilization, though, we can talk about efficiency in terms of that. So, for example, if V2 uses half the systems that V1 used, then it’s twice as efficient (2x). Or V2 could require 20 machines while V1 required 10, in which case V2 is half as efficient.

Definition 2: Efficiency: the change in utilization between two versions of the system.

Once again — for all data center people out there, I’m oversimplifying for the sake of this particular argument.

In the paragraph quoted from the Times above there’s a bit of a jumble of terms. It starts talking about “energy efficiency” (which, in our definition, would be 100%-Utilization%) and then it talks about using “they were using only 6 percent to 12 percent” which is straight utilization%. I think playing fast and loose with terminology like that can get in the way of really knowing what we’re talking about, which is why I’ve spent some time defining, at least, what I’m talking about.

INSERT OBVIOUS TRANSITIONAL SECTION TITLE HERE

Ok, fine, the squirrel in charge of coming up with section titles is on a break and I can’t think of a good one, so let’s just keep going now that we’re armed with these terms. I’ve repeatedly stated that the average utilization, by itself, is meaningless. Allow me to elaborate on one key reason why. Suppose you measure ten systems and get the following utilization values: 10, 10, 10, 10, 10, 90, 90, 90, 90, 90. This gives you an average utilization of 50% — already something to note, since the average clearly isn’t telling the whole story. Assuming for a moment that these values are comparable across systems (a big if, but again: simplify!), the average is really hiding something important: five systems have “bad” utilization (10%) while the other five have “good” utilization (90%).

But wait, why am I using quotes around “good” and “bad”? Because this is something the article implies: that more utilization is better, but this is exactly why talking about efficiency matters. Maybe, just maybe, the five systems at 10% are actually systems with software that do the same thing but more efficiently, and from that perspective our assessment of good or bad can get inverted.

Suppose the five systems at 90% are part of a cluster. Suppose one of the five system crashes. Suddenly the load that went to that system has to be distributed to all the others and it quickly puts the remaining four systems at over 100% and the entire cluster could crash. The load would never get distributed perfectly across systems among other problems — in reality we’d probably looking at a cascade of failures as each individual remaining system crosses the 100% threshold, putting more load on others, and so forth, ending with the final state where the entire cluster is down and everyone, from the CEO to your users, is screaming bloody murder.

Suddenly we’re looking at those systems at 90% with some suspicion, no? In fact if you keep on the simplification vibe, you could argue that for a five-system cluster the only way to protect it from any one machine going down (assuming a crash at 100% load, also not a given…) is to maintain the average utilization at around 75% or so, which in the event of one system crashing would leave the remaining four at 93.75%. 80% average would mean one system crashing leaves everyone at 100%. So the difference between 75% and 80%, which seems minimal, is the difference between life and death in this scenario.

But I said can get inverted because there’s other factors at play. Take just one: uptime requirements. Suppose you’re somehow OK with the idea that one system crashing takes everything down, and you’re willing to trade uptime for costs (ie., using only five machines). Then you’d be ok, probably. Everyone I know, however, would want to protect from this eventuality.

This scenario isn’t contrived to get the answer to come up the way I want. It is typical, certainly far, far more common than a simple and straightforward DC setup components get swapped instantly, you’re always at maximum efficiency, and every system does the same thing. Reality isn’t like that. Speaking of which…

A (SMALL) DOSE OF REALITY

Let’s complicate things a bit further, or rather, make them just slightly more realistic. Let’s say that your 90% utilization was measured during a time where you had 100 simultaneous users on average. Tomorrow, though, Pando Daily posts a glowing review of your website and usage doubles. Doesn’t sound too far-fetched. Now what? The 90% utilization, if there’s a strong correlation between simultaneous users and load (which is typical), is suddenly guaranteed to bring you down. Suddenly you’d need much lower baseline utilization to be able to handle that spike, and the 90% looks like a bad idea once again.

Going a bit further, imagine the two clusters of five systems are actually performing the same function, but one of them is a V1 (90%) and the other one is the more efficient version you just deployed (10%). This is something that happens all the time. As you can instrument a piece of software running under real-world load, you can understand better what leads to that load, you can optimize your software, and sometimes massive jumps in efficiency are common. So when the 90% is deemed as a problem, the team gets to work, and they come up with a V2 that is around an order of magnitude more efficient.

Which leaves you with two possible versions of the software doing the same work, but one uses 10% of the systems and the other 90%. Which one is better? I think everyone would agree that the newer, more efficient system is better, even though if you deployed it, it would suddenly make your utilization plummets across the board, which, according to the article, is “bad”. Oops.

Hold on, I hear you say. Now that you’ve got software that is more efficient why don’t you just decommission the extra systems you don’t need? Then you’d be using less power overall and you could cut down, say, from 10 machines at 90% to 5 machines at 20%, which should give you a nice margin for error.

Aha! This surely sounds true, but this is also where the oversimplifications I keep bashing get tricky since real world gets in the way. Two interrelated points. First, there aren’t that many (if any) tidy switchovers from V1 to V2. In increasing efficiency you may have introduced bugs. To protect against that, you start having to test (more machines for that!), and even when testing says it’s ok to deploy you will start small — deploy to a few machines only, wait, verify. Deploy a bit more. Wait, verify.

THE MISSING VARIABLE: TIME

The process we just described has an important variable that we haven’t looked at so far: time. As in, in the process of doing this, time passes.

Sounds obvious right? But if this is true, it’s also true that as time passes, requirements change. Requirements, here, encapsulating all the factors, external and internal, that go into delivering your service or product. Whatever you’re doing, you’re not dealing with a static entity, but something that is constantly evolving, both because you’re changing it from inside as you update the software, fix bugs, evolve architecture, and deploy new hardware, but also because the external factors are constantly in flux. Likely, you will now have more people using the systems (or, for things that are just APIs, maybe more machines). Or load may have changed due to a feature.

The passage of time here is the critical element missing, and it gets in the way of the ideal scenario in which we allow the system to truly “contract” and use fewer resources in terms of power. By the time you’re sure you can decommission those extra machines, it’s quite possible that you now have other uses for them, and even if you did find a sliver of time to do this you may not have the option since system growth, or feature changes, or whatever, may be telling you that you will need those extra machines in a week or a month month, and therefore taking them down would be simply a waste of time. At Ning, for example, our overall hardware footprint remained largely stable in terms of number of machines, and therefore total power consumed, even as we went from 1 MM registered users to 50 MM and beyond. Setting aside the fact that this took an enormous amount of work and constant vigilance, it also meant that the system that could handle 50 MM users with the same amount of hardware as for 1MM was very different than the original. And throughout that process, many machines would oscillate in their degree of utilization, from the low teens to the high 70s. Over a period of a few years you could take measurements at different points in time that would either make us appear either like geniuses or criminally stupid — if all you looked at was utilization, and if “less utilization is bad” was your only guiding principle.

If we can agree that low or high utilization alone is meaningless, and that your utilization will fluctuate, maybe drastically, as you optimize, we can start to ask more appropriate questions. For example: can we do better in terms of releasing idle capacity as efficiency increases? Absolutely, and the widespread use of virtualization in recent years also means that it’s now far easier to have capacity fluctuate along with load. The APIs for system management popularized by public cloud infrastructure companies (e.g. EC2, Rackspace cloud, etc.) have led in the last couple of years to more and more services where capacity is instantiated on demand according to load, leading to a more efficient use of those virtualized resources.

Even there, though, we have a problem, for if EC2 allows you to instantiate and tear down a hundred AMIs without giving it a second thought, it’s also necessarily true that there must be an actual hardware footprint doing nothing but waiting around for that to happen. Having people constantly disconnect and reconnect machines is something that is not just not feasible, and that in almost every case will involve as much waste as taking a system down, since in the world we live in the capacity we need to run the ever-increasing complexity of Internet infrastructure keeps going up, which means that whatever you stop using today, you’re guaranteed to need tomorrow. With 60% or more of humans the planet still not online, there’s clearly still a lot of growth left. For the biggest services, it’s common to have to constantly be deploying new capacity just to account for growth.

This leads us to yet another question, perhaps the last one in this particular chain of thought: Should we in general accept less reliability from online services given that it has real environmental impact, and real cost?

My own answer to this would be a clear NO. I don’t think you can have these systems simultaneously be part of the fabric of society (a point I’ve made before) and have them be “partially reliable,” just like there’s no way to be “partially pregnant.” Reliability is intrinsically tied to the usefulness of these services. Perhaps there are ways in which we can bake in more asynchronous behavior in some cases, but when a lot of what systems do is real-time, 24/7/365 and worldwide, this isn’t something we’ll be able to exploit frequently. We have crossed the Rubicon, so to speak, and have to see this through.

THE CONCLUSION, OR, BELATEDLY MAKING SENSE OF THE TITLE OF THIS POST

Utilization in data centers is an important issue, but talking about it bereft of context is not really that useful. In particular, without also talking about efficiency, and all the parameters that go into it including what kinds of applications are running, what the goals are, what the requirements are, etc., is going to leave us with nothing but incomplete answers, and here incomplete will leave us way too close to incorrect for comfort.

And this isn’t just about the Internet. Is it a valid question to talk about utilization in, say, TV stations? Or other major source of media for that matter? Pre-digital music distribution…?. Sure it is, but it’s not something we focus on because the use of energy in other media is a one or two steps removed from what we see, so it’s easier to ignore even if it’s there all the same. When was the last time you remember a TV station couldn’t broadcast? Are we to believe that they never had a power failure? No. They have backup systems. Those evil-sounding lead batteries or diesel generators.

Context.

By looking at context and simply shifting the assumption that Internet infrastructure is actually run fairly well, given all the requirements and its rapid evolution, we realize that what we really should be wondering about what makes us build systems this way, rather assuming they are not built properly. Should we talk about utilization? Or is this really about what drives it, and therefore utilization is part of the discussion but not the central point?

Channeling Reverend Lovejoy for a moment, we could say, then: “Short answer yes with an if, long answer, no, with a but…”

santa claus conquers the martians

Part 2 of a series (Part 1Part 3)

PRELUDE

I’ve had a busy week, and have been trying to sit down and put together a followup to my response to the NYT’s article on data centers.

I write the title, and I soon as I do, my mind goes blank. I read the title again. What the hell was I thinking? I am looking at the screen, white space extends below the blinking cursor, mirrored by something somehow stuck in my head, alternating on/off, rumbling lowly like an idling engine: I swear I had a point.

So naturally I start to think that this, perhaps, should be the new title. Which, in the expected recursion path that would follow naturally ends up in another meta-commentary paragraph (also with a simile close to its ending), which I decide not to write. Recursion upwards, probably to conform with an implicit image of happiness we may or may not feel (or is in this case is really quite unwarranted and even more, even worse: unnecessary) but we should generally imply anyway, because these days if you’re not explicitly happy something must be wrong, and therefore it must be fixed. Neutral has become a bad state to be in, apparently, long after being “with us or against us” became a common way to think about nearly everything. No, recursion has no direction except, perhaps, into itself, but it now occurs to me that years of looking at function call stacks have trained me (hopelessly comes to mind, but that’s also not happy) to think of recursion as up or down, rather than, say, horizontally from right to left.

Fascinating, I know.

– oOo –

I will eventually get to Santa Claus and the Martians, but for the moment, back to the article.

The series was titled “The Cloud Factories”, and right there it broadcast ever-so-subtly that it was to be something intended to get worked up about.

“Factory” can mean “the seat of some kind of production” but in this case the weight of the word is in the manufacturing angle. This doesn’t quite feel right, though. A factory is where things are built, sequentially, or at least mostly sequentially, and a cloud is anything but built, and the process is anything but sequential. A cloud emerges, and if we switch to the definite article and the proper noun with all its implications and uppercaseness, it’s also true that The Cloud is an emergent phenomenon. Metaphors are often misapplied, can be incorrect, but it’s not that often that a metaphor involving an overloaded term (“cloud”) is both misapplied and incorrect in the exact same way for nearly all the meanings of the term. This takes some skill.

So, yeah, the point of the title of the series was not to be accurate as an analogy, but to evoke. Specifically, an image. Much like the factory in which they make Itchy & Scratchy cartoons in The Simpsons has chimneys and dark dense smoke coming out of them, as does every factory in The Simpsons, regardless what it’s for. The “factories” in the “The Cloud Factories” seem to intentionally or not (but can this really be unintentional?) transmit the idea of dirt we associate at a reptilian level with “factory”. Dirt. Pollution. Guilt by association. Then — the title of the article, the first of two so far, drops the subtle imagery: “Power, Pollution and the Internet.” Strangely enough, beyond the title the word “pollution” appears exactly once in the entire article.

Pollution and the Internet. How could one not react to that? What I wrote a week ago was pure reaction, if nothing else to the reactionary tone of the article, but by now I have accumulated enough in my head to maybe add something else to this topic, which, perhaps predictably, has a bit less to do with the contents of the article itself (not that that topic is exhausted by any means) but on what is one possible way to look at its main thrust through the lens of discourse on technology nowadays, how we use metaphors and analogies to convey something that we haven’t yet internalized, and the factors at play in sustaining a reasonable and reasonably deep conversation in an environment that doesn’t lend itself to that. And if all of this in retrospect looks obvious, consider this the admittedly convoluted way in which I am creating a reminder, a mental note: something to pay more attention to.

On to it, then.

ACTION REACTION RETRACTION

Action — argument (paraphrasing, summarizing): “That which powers our online services and more generally the Internet is really a hidden pollution machine run by people fearful of reducing waste, even though the means to do so are readily available.”

Reaction — counterargument (now really summarizing): “Not true.”

That the argument isn’t true may be indeed true, and yet to not just agree with the counterargument because, for example, you respect whoever made it but to understand it requires a degree of experience and training and knowledge that is well beyond what most people could get to because, quite simply, they have their own jobs and lives. Indeed, if it’s not your job and it’s not your life (and for most of the people for whom this is a job, it’s also our life), you really shouldn’t bother. The modern world, and to some degree the very basis of our progress is that we use things that we can’t build, and in many cases can’t even understand. We travel by plane even though many people have no idea how it works, let alone are able to build one.

And that is just fine.

We trust the plane, though, don’t we? Well, now we do, but 150 years ago the thought that you could pack tons and tons of baggage and instruments and hundreds of people into a tin can and by pushing air at unimaginable speed through smaller tin cans attached to the larger tin can with bolts you would get the thing to fly was unpopular indeed.

Bear with me for a minute here. I’m getting somewhere. Promise.

As I was writing a week ago I was typing frantically and in the process of switching windows I entered “action reaction retraction” into Google, and the last result visible before I had to scroll said “Robert H. Goddard. The New York Times.” which seemed intriguing enough, and following there were notes on a retraction that seemed almost too appropriate. Really? was the thought, so I went to the Times archives and found the quote, but in the process lost the bizarre way in which I stumbled on to it. I spent almost an hour yesterday, I kid you not, going through the browser’s history to see what I’d done, and I still can’t remember why I was typing that except to think that I must have read this before, and further googling just for the quote shows that it’s been mentioned a few times in the last several years. Sarah Lacy included the quote in her followup, along with her own thoughts regarding an earlier Times story on Tesla motors which shows if not a pattern at least some concordance of mistakes all going in the same direction, or misdirection.

The quote was a retraction from the Times in which it acknowledges:

“Further investigation and experimentation have confirmed the findings of Isaac Newton in the 17th century and it is now definitely established that a rocket can function in a vacuum as well as in an atmosphere. The Times regrets the error.”

This was triggered by Apollo 11’s flight, when, one presumes, a 50-year-old takedown of rocket pioneer Robert Goddard on the very pages of the Times might have come to their attention:

“That Professor Goddard, with his ‘chair’ in Clark College and the countenancing of the Smithsonian Institution, does not know the relation of action to reaction, and the need to have something better than a vacuum against which to react — to say that would be absurd. Of course he only seems to lack the knowledge ladled out daily in high schools.”

The Times regrets the error. This reminded me of what we could call the case of Catholic Church v. Galileo. At least the Vatican actually apologized to Galileo directly, although in fairness to the Times, it took the Vatican closer to 400 years to get to that point.

The reason I bring up the quote again is that there’s a certain tone of mischief detectable in it, since no one can possibly believe that they are seriously a) realizing just now that rockets actually work in a vacuum and b) that the way to correct for this is to say that “this confirms the findings of Isaac Newton.” Points for whoever wrote it: it was funny.

And just to be clear: this isn’t about giving a pass to the Times, but to try to figure out why this seems to be a recurring problem from which the Times seems far from exempt, even when we may be inclined to think they are exempt from it.

The question is, then, why would they, the nebulous they but that nevertheless is actually people, talented as they may be, would have originally thought that trashing Goddard, someone with enough credentials to presumably give him the benefit of the doubt at the very least, was a good idea?

Perhaps because in doing that they were reflecting, ahem, the times — the prevailing sense of what was or wasn’t possible in the age. The “truth” as they saw it, because truth and facts are two different things. To top it off, in this particular case a giant rocket traveling at some 11,000 meters per second was, as an undeniable fact, still very much in the future, but when the rocket was actually up there, actually carrying three people and countless gizmos and measuring devices and chemicals of all kinds, you didn’t have to know anything about physics to realize that there was something to this seemingly crazy idea of rockets in space after all.

Back to trusting airplanes at last: We trust the plane because we see it. We feel, down to our bones, the effort of the engines as it takes off and lands. If someone started to argue that the typical turbine was somewhat wasteful, I don’t think I’d be alone in thinking Well, while I’m inside the plane and on the air, I’d prefer a little waste to not being, you know, alive.

So is there something to the idea that, in the popular imagination, not seeing is disbelieving, to invert the well-known dictum?

More importantly, given the complexity and sheer scale of the systems involved in running the Internet, what would it take to “see” when what we’re talking about can’t, ever, actually be seen?

…AND YOU ALWAYS FEAR… WHAT YOU DON’T UNDERSTAND…

That’s a line from Batman Begins uttered by Mafia mob boss Carmine Falcone while he is explaining to a young Bruce Wayne why he should just stop acting all flustered about crime and go home. It’s a critical line not only in the film but in the overall story arc of the trilogy, since within it we find Bruce Wayne’s drive to become Batman. Bruce agrees with Falcone’s thesis but not his solution, decides to understand, disappears into the underworld, then returns, seven years later, as Batman.

Understanding — not fearing — takes knowledge, and knowledge takes a long time and effort to develop.

Convincing people that flying rockets in space “only” required that we actually fly a rocket in space. What would be the equivalent for getting people to accept that how data centers work is not some perennial waste, where secret gerbils run mindlessly within wheels, most of the time doing nothing at all, wasting energy and in the process laying waste to the planet as well? Well, one way would surely be to getting everyone to spend the equivalent of Wayne’s “seven years in the underworld” which in this case would be not only getting a degree in computer science but spend a good amount of time down in the trenches, seeing firsthand how these things are actually run.

That this is an impractical solution, since we can’t have the whole planet get a CS degree or work in a data center, is obvious. It leaves us with the alternative of using analogies and metaphors to express what people still haven’t internalized, and probably will never be able to internalize, in the way that they have the concept of a rocket or an airplane. Before planes flew, the idea of them also had to be wrapped in analogies and metaphors, usually involving birds. The concept of a factory would have undoubtedly required some heavy analogies to be explained to people in, say, the 16th century. We grasp at something that is known to make the unknown intelligible.

The analogies we choose matter, however. A lot. Which is why I keep talking about planes not factories. A modern commercial jet is a much more apt analogy for the type of “waste” involved in running a modern data center.

There is waste and pollution involved in running a jet, as anyone can plainly see. Sometimes the waste is obvious (empty seats), sometimes it’s not (unnecessary circuitry), but generally people don’t doubt that the good people at Boeing et. al. are always doing their damnedest to make the plane as efficient, safe, and effective as possible. The same is true of Internet infrastructure.

WHERE WE FINALLY GET BACK TO SANTA CLAUS CONQUERING MARTIANS

You may or may not agree with the plane analogy, there may be better ones, there are more things to discuss and there certainly is a need for us in the industry to engage more broadly and try to explain what’s going on as long as everyone in the world doesn’t have a CS degree (a man can dream).

So for all the faults I could find with the article, I think it was good that it triggered the conversation, and herein lies our second conundrum.

This “conversation” — it will require effort to be carried out.

A brief detour: reading Days of Rage a commentary in the latest issue of The New Yorker, which references Santa Claus Conquers The Martians while talking about the “Muslim Rage” of recent days over a YouTube video no one had actually seen, certainly not before the protests. I agree with a lot of the article, except on one point:

“The uproar over “Innocence of Muslims” matters not because of the deep pathologies it has supposedly laid bare but because of the way the film went viral.”

Psy and Gangnam Style was viral. This video wasn’t. If anything, from what we know, it seems to be quite the oppositeof viral, since apparently it was simply an excuse used by people in power to rile up the unhappy (there’s that word again) masses so they could have something to do: “Angry? Unemployed? Bored? Feel you have no future? Here, go burn an embassy.” And how irrationally angry you have to be to somehow find that looting and burning and killing either solves a problem or makes up for anything or is even, just, a remotely justified way to react. How displaced you have to be from yourself and disconnected to what surrounds you. I can hypothesize, only. At points in my life I’ve had little or no money but never felt in a way that would ever lead me to react in that way. Not that this is about money, I know, it’s just one of the factors (probably), but one that I can try to relate through. But I digress.

SCCTM is indeed an actual movie and the reason I bring it up is that I had seen it years ago in an MST3K episode, and when remembering that it occurred to me that what happened in the Middle East was a more, perhaps the most, extreme version of a pervasive phenomenon, that of reacting to what our perception is of something rather than to the thing itself.

Mind you, this isn’t one of those “things were better in my time” type of arguments. While there was a time decades ago when in-depth roundtables in media were more common fare, this happened in an environment in which the amount of raw data to process was far, far less than it is now. We are overwhelmed by data but lacking in information. This isn’t a matter of access to technology, either. I’d bet a lot of the people doing the burning and killing in Benghazi had cellphones. We all do.

This, deep in the weeds of this post (essay?), is what triggered the topic in my head. The end of the chain of associations: that what we’re often doing these days to handle all the information that we’re exposed to would be tantamount to MST3K dispensing with the actual viewing of the movie and simply skipping to the part where we make fun of it. It wouldn’t be the same, would it? Context is critical, but we react in soundbites and generate storms of controversy over a few words which can’t possibly have context attached, because there’s simply no space for it, anywhere.

Twitter and to some degree Facebook are often blamed, unfairly I think, with a supposed devolution of our society into people trapping their thoughts into contextless cages 140-characters in size. I don’t think there’s any question, though, that we humans are and have always been lazy if we can get away with it, and that the deluge of information leave us with little time to reflect on it, so the mind recoils and defends itself with quips and short bursts, and Twitter (and Facebook) are a good mechanism for that. It just so happens that this constant jumping around topics superficially is both a) effective as a dopamine release mechanism –read: addictive– and b) the perfect way of thinking of yourself as informed and on top of everything and yet truly involved in nothing. Why isn’t Twitter or Facebook to blame, then. Let me give you a Twitterless example: sad advertisement on TV, people starving, a catastrophe somewhere. Text a number and give $3. Done. Back to watching Jersey Shore, or 60 minutes, or whatever.

Twitter, Facebook, all of them, are not the proximate cause. They are an effect. A reaction.

The environment we live in has fundamentally changed because there is readily available, quite simply, more data about everything, a large part of which is a barrage of trivia and gossip — which is to be expected since they are, ahem, trivial to generate. If Lindsay Lohan having a traffic accident is enough to generate massive news coverage and the cascade of reaction that follows, topics that are deeper and more complex and are more difficult to grasp will find it hard to compete.

It’s something new, or relatively new in historical terms, and I don’t think we know how to handle this deluge yet. We are drinking from a seemingly limitless flood of information but we haven’t yet figured out how to close the faucet every once in a while. We don’t necessarily drown in it but this flood that is constantly rushing around us leaves us with no time to reflect on any one point.

Information overload! Pfft. This isn’t a new idea! I bring it up not only because I think that we are increasingly using (creating) media that is suited to how we are trying to deal with it, and the edifice we construct with all of it is not well-optimized to transmit complex ideas (this, also, is not at all original), and so it seems critical that we have to work hard at finding the right metaphors and analogies, the right tools to talk about how the machinery of the Internet works. Tools and machinery, here, somewhat ironically encapsulating the point.

AND NOW FOR THE SURPRISINGLY SUCCINCT CONCLUSION

Analogies matter, metaphors matter, and we need to find better ones to talk about what the Internet is (for example, a “global village” it is not, and this term has luckily fallen by the wayside, but the many reasons why will have to wait for another time). We also have to contend with a shifting media environment in which a conversation like this can get all too easily lost in the noise, not because, as a cynical interpretation would have it, people only care about Snooki or the Kardashians or whatever, but because until we figure out how to live and engage with complexity when soaking in data there will only only surface and precious little depth.

And if there’s an additional meta-point to “Power, Pollution and the Internet,” something else that is important beyond the specifics in the article, it is that we as an industry have left a void that can be filled with anything, and if we don’t engage and try make what we do more comprehensible for everyone who, rightly, doesn’t have the time to understand it because they’re busy running the rest of the world, then we in the industry have no one to answer to for it but ourselves.

Part 2 of a series (Part 1Part 3)

a shocking new way to get google maps on iOS 6

  • Step 1: Visit maps.google.com.
  • Step 2: (optional) save shortcut to homescreen.

Hm.

PS: Yes, defaults matter, but the native app was never that much better than the web app.

a lot of lead bullets: a response to the new york times article on data center efficiency

Part 1 of a series (Part 2, Part 3)

Note: This is 5,000 word post (!) in response to a 5,000-word article, since I thought it necessary to go beyond the usual “that’s wrong”. A detailed argument requires a detailed counter-argument, but, still, apologies for the length. 

As I was reading this New York Times article on data centers and power use, there was mainly one word stretching, doppler-like through my head: “Nooooooo!”

Not because the article exposed some secret that everyone that’s worked on websites at scale knows and this intrepid reporter was blowing the lid on our quasi-masonic-illuminati conspiracy. Not because there was information in it that was in any way shocking.

The reason I was yelling in my head was that I could see, clear as day, how people who don’t know what’s involved in running large scale websites would take this article. Just look at the comments section.

The assertions made in it essentially paint our engineers and operations people as a bunch of idiots who are putting together rows and rows of boxes on data centers and not caring what this costs to their businesses, nay, to the planet.

And nothing could be further from the truth.

There is one thing that the article covers that is absolutely true: data centers consume a hell of a lot of power. Sadly, the rest is a mix of half-guesses, contradictions, and flat-out incorrect information that creates all the wrong impressions, misinforms, and misrepresents the efforts and challenges that the people running these systems face everyday. In the process, the article manages to talk to precious few people that are really in a position to know and explain what’s going on. In fact, when I say precious few, I mean one: Jeff Rothschild, from Facebook, and instead of asking him questions on data centers the article just uses one amusing anecdote from Facebook’s early days that makes eng/ops look like a bunch of monkeys running around with fans.

This isn’t just an incredibly inaccurate representation of the dedication and hard work of eng/ops everywhere in the computer industry, I know for a fact it’s also inaccurate in what regards to Facebook itself. I imagine Facebook engineers (and that of any other website really) reading this article, thinking about the times they’ve been woken up in the middle of the night to solve problems that no one has ever faced before, for which no one has trained them, because no university course and no amount of research prepares you for the challenges of running a service at high scale, and having to solve all that as fast as possible, regardless of whether it’s about making sure that someone can run their business, do their taxes, or that a kid halfway around the world can upload their video of a cat playing the piano.

Before I continue, let me say that, even if I am (clearly) a bit miffed, I respect the efforts of the reporter, even with the inadequate sources that the story quotes, and there is an important story here but it’s not the one he focused on. The question is not that there’s inefficiency, or rather, under-utilization of power. There’s some but not as much (the main figure the article quotes is that data centers, or DCs for short run at 6-12% utilization is just completely made up by consultants and I can’t possibly imagine how they arrived at it).

The question is why.

Sure, there’s some inefficiency. But why? There are many reasons, but before I get to them, let me spend some time on the problems in the article itself beyond this central issue.

The problems in the article

Let me go through the biggest issues in the article and debunk them. To begin:

Energy efficiency varies widely from company to company. But at the request of The Times, the consulting firm McKinsey & Company analyzed energy use by data centers and found that, on average, they were using only 6 percent to 12 percent of the electricity powering their servers to perform computations. The rest was essentially used to keep servers idling and ready in case of a surge in activity that could slow or crash their operations.

First off, an “average,” as any statistician will tell you, is a fairly meaningless number if you don’t include other values of the population (starting with the standard deviation). Not to mention that this kind of “explosive” claim should be backed up with a description of how the study was made. The only thing mentioned about the methodology is that they “sampled about 20,000 servers in about 70 large data centers spanning the commercial gamut: drug companies, military contractors, banks, media companies and government agencies.” Here’s the thing: Google alone has more than a million servers. Facebook, too, probably. Amazon, as well. They all do wildly different things with their servers, so extrapolating from “drug companies, military contractors, banks, media companies, and government agencies” to Google, or Facebook, or Amazon, is just not possible on the basis of just 20,000 servers on 70 data centers.

Not possible, that’s right. It would have been impossible (and people that know me know that I don’t use this word lightly) for McKinsey & Co. to do even a remotely accurate analysis of data center usage for the industry to create any kind of meaningful “average”. Why? Not only because gathering this data and analyzing it would have required many of the top minds in data center scaling (and they are not working at McKinsey), not only because Google, Facebook, Amazon, Apple, would have not given McKinsey this information, not only because the information, even if it was given to McKinsey, would have been in wildly different scales and contexts, which is an important point.

Even if you get past all of these seemingly insurmountable problems through an act of sheer magic, you end up with another problem altogether: server power is not just about “performing computations”. If you want to simplify a bit, there’s at least four main axis you could consider for scaling: computation proper (e.g. adding 2+2), storage (e.g. saving “4” to disk, or reading it from disk), networking (e.g. sending the “4” from one computer to the next) and memory usage (e.g. storing the “4” in RAM). This is an over-simplification because today you could, for example, split up “storage” into “flash-based” and “magnetic” storage since they are so different in their characteristics and power consumption, just like we separate RAM from persistent storage, but we’ll leave it at four. Anyway, these four parameters lead to different load profiles for different systems.

The load profile of a system can tell you its primary function in abstract terms. Machines in data centers are not used homogeneously. Clusters of them may be primarily used for computation, other clusters for storage, others have a mixed load. For example, a SQL database cluster will generally use all four heavily, while a cluster that serves as a memory cache would only use RAM and network heavily. As an aside, network is important in terms of power consumption not in and of itself, since running a network card is a rounding error in terms of power consumption, but because to run a big network infrastructure requires switches in heavily redundant configurations that actually do count, not to mention the fact that a bigger network equals more complexity that has to be managed, but we’ll get to that in more detail later.

I don’t doubt that McKinsey got a lot of numbers from a lot of companies and then mashed them together to create an average. I’m just saying that a) it’s impossible to have the numbers measure the exact same thing across so many different companies that are not coordinating with each other, and b) that the average is therefore of limited value since the load profile of all of these services is so wildly different that if you don’t have accurate data for, say, Google, your result is meaningless.

1. Right from the start, one of the primary legs on which the article stands on is either incorrect, or meaningless, or both.

Moving on.

A server is a sort of bulked-up desktop computer, minus a screen and keyboard, that contains chips to process data.

No, that’s not what a server is. It is not a “bulked-up desktop”. In fact, the vast majority of servers these days are probably as powerful as a typical laptop, minus battery, display, graphics card and such. And in many cases the physical “server” doesn’t even exist since everyone doing web at scale makes extensive use of virtualization, either by virtualizing at the OS level and running multiple virtual machines (in which case, yes, perhaps that one machine is bigger than a desktop, but it runs several actual server processes in it) or distributing the processing and storage at a more fine-grained level (MapReduce would be an example of this). There’s no longer a 1-1 correlation between “server” and “machine,” and, increasingly, “servers” are being replaced by services.

The reason this seemingly minor thing matters is that it creates an image that a datacenter is readily comprehensible. Just a lot of “bulked up desktops.” I understand the allure of this analogy, but it’s not true, and it creates the perception that scaling up or down is just a matter of adding or removing these boxes. Sounds easy, right? Just use them more efficiently!

2. Servers are not “bulked up desktops” and this is critical because infrastructure isn’t just a lot of bricks that you pile on top of each other. There’s no neat dividing line. You can’t just say “use less servers” to increase efficiency, which is what the incorrect analogy leads to.

Next!

“This is an industry dirty secret, and no one wants to be the first to say mea culpa,” said a senior industry executive who asked not to be identified to protect his company’s reputation. “If we were a manufacturing industry, we’d be out of business straightaway.”

Say what? That infrastructure is inefficient is a dirty secret? First off, inefficiency is not a secret, and it’s not “dirty.” Maybe Google or Facebook don’t publish papers talking about their utilization, but the distance between not shouting this from the rooftops and this being a “dirty secret” is indeed long.

This statement, from an anonymous source no less, matters because it creates the sense in the article that the industry is operating in the shadows, trying to hide a problem that, “if only people knew” would create a huge issue. Nothing could be further from the truth. There are hundreds of conferences and gatherings a year, open to the public, where the people that run services get together to discuss these problems, solutions, and advances. Everyone I know talks about how to make things better, and, without divulging company secrets, exchanges information on how to solve issues that we face. There’s ACM and IEEE publications (to name just two organizations), papers, and conferences. The reason it appears hidden is that it’s just one of the many areas that only interest the people involved, primarily nerds. It’s arcane, perhaps, but not hidden.

3. This isn’t any kind of “industry dirty secret.” This statement only helps in making this appear to be part of some conspiracy which doesn’t exist and papers over the real issues by shifting attention to the supposed people who keep this “dirty secret.” That is, it seems to identify a group of people at which we can point our proverbial pitchforks, and nothing else.

Next!

Even running electricity at full throttle has not been enough to satisfy the industry. In addition to generators, most large data centers contain banks of huge, spinning flywheels or thousands of lead-acid batteries — many of them similar to automobile batteries — to power the computers in case of a grid failure as brief as a few hundredths of a second, an interruption that could crash the servers.

“It’s a waste,” said Dennis P. Symanski, a senior researcher at the Electric Power Research Institute, a nonprofit industry group. “It’s too many insurance policies.”

The first paragraph in this quote seems to imply that batteries (“lead-acid batteries” which sounds more ominous than just “batteries”) and “spinning flywheels” are there because the industry is “not satisfied with running electricity at full throttle”.

To begin — how do you run electricity at less than “full throttle”? Is there some kind of special law of physics that I’m not aware of that lets you use half an electron? If you need 2 amps at 110 volts, that’s what you need. If you need half, you use half. There’s no “full throttle.” A system draws the power it needs, no more, no less, to run its components at a particular point in time. Could you build machines that are more efficient in their use of power? Of course, and people are working on that constantly. Google, Facebook, Amazon, all spend hundreds of millions of dollars a year running data centers, and a big chunk of that is power. It’s a cost center that people are always trying to reduce.

Then there’s the money quote from Mr. Symanski, someone I’ve never heard of, saying that “It’s a waste. It’s too many insurance policies.”

Really? Let’s look at what happens when a data center goes offline. How many the-sky-is-falling articles have been written when significant services of the Internet are affected? A ton, of course, many of them in The New York Times itself. Too many insurance policies? Not true. Managing these systems is incredibly complex. Eng/ops people don’t deploy systems for the fun of it. We do this because it’s required, and, if anything, we do less than we know we should because there’s never enough money or people or time to deploy the right solution, so we make do with what we can and make up for the difference with a lot of hard work.

4. It’s not “too many insurance policies.” Redundant systems in data centers aren’t perfect by any stretch of the imagination, but the article strongly implies that this is being done willfully and flying in the face of evidence that says that’s unnecessary. This is flat-out not true.

These next two paragraphs are just incredible in how they distort information:

A few companies say they are using extensively re-engineered software and cooling systems to decrease wasted power. Among them are Facebook and Google, which also have redesigned their hardware. Still, according to recent disclosures, Google’s data centers consume nearly 300 million watts and Facebook’s about 60 million watts.

It quotes essentially unnamed sources at Facebook and Google (or their PR people) in saying that they are using software and cooling systems to decrease wasted power, and then it goes on to say that they still consume a lot of power, creating the impression that they are not really solving the problem. Here’s the thing though: if you built a more efficient rocket to go to the moon, it would still consume a lot of power. No one would be shocked to learn that, would they?

But there’s more!

Many of these solutions are readily available, but in a risk-averse industry, most companies have been reluctant to make wholesale change, according to industry experts.

This is probably the one paragraph in the article that sent my head spinning. It makes it appear as if there’s “solutions” that everyone knows about but no one uses because we’re a “risk-averse” industry.

What “solutions”? Who is “risk-averse”? Which companies are “reluctant to make wholesale change”? Who are the “industry experts” that assert this is the case?

This last paragraph is quite simply, flat-out factually wrong. There are no “solutions” that are “readily available.” There simply aren’t. The article later quotes some efforts around higher utilization and from companies who are working in the area that imply that if everyone just did whatever these people are doing, everything would be better. And I say “whatever these people are doing” to capture the flavor of the article, since what they are doing is not magic, they just seem to be more efficient at queuing and batching processing. The problem is that not everything can be efficiently batched, certainly not when your usage isn’t dictated by your own schedule but by end users around the world, and each particular use case requires its own solutions. We’re not dealing with one problem but many, but I’ll get to that in a moment.

5. There are no “readily available” solutions to this problem, because there isn’t just one problem to solve. There’s multiple overlapping challenges that are all different and sometimes contradictory factors (e.g. exchange utilization for failover capabilities), and no one is “reluctant to make wholesale change” — these are huge, complex systems that can’t be just replaced for something else.

(And by the way, how ironic is it that an article about “an industry reluctant to make wholesale change” is being run by… a newspaper?)

Finally, just to wrap up since I’ve gone on way longer than I planned on the quotes and I don’t want to reprint the whole article, I want to touch on the contradictions. One giant contradiction in the article is that it talks throughout about how the “industry” (which by the way, is never defined clearly, although I’m assuming we are talking about the computer industry in general, since by now everyone pretty much uses data centers?) is “risk averse” while also covering the sheer scale of the infrastructure that exists. Most of this infrastructure has been built up over the last ten years. Google, Facebook, Amazon, and everyone else all have ramped up, by orders of magnitude, their operations over the last five years. None of these companies even existed 15 years ago!. The social web and mobile have exploded in the last 3 years. There’s simply no way the industry can simultaneously build up this massive infrastructure that sustain exponential growth rates in traffic and usage and be so hugely risk averse. Maybe that will be true in a couple of decades. It isn’t true now, and whatever “expert” the reporter has talked to has not really been involved in running a modern data center. A lot of quotes in the article seem to come from people at “The Uptime Institute,” an organization I’ve never heard about, so maybe for a followup they should talk to the people who are actually running these systems at Facebook, Google, and others.

Nationwide, data centers used about 76 billion kilowatt-hours in 2010, or roughly 2 percent of all electricity used in the country that year, based on an analysis by Jonathan G. Koomey, a research fellow at Stanford University who has been studying data center energy use for more than a decade. DatacenterDynamics, a London-based firm, derived similar figures.

The industry has long argued that computerizing business transactions and everyday tasks like banking and reading library books has the net effect of saving energy and resources. But the paper industry, which some predicted would be replaced by the computer age, consumed 67 billion kilowatt-hours from the grid in 2010, according to Census Bureau figures reviewed by the Electric Power Research Institute for The Times.

This represents another contradiction which spans the whole article: there are many mentions of how it’s impossible to accurately measure how much power is being used, and there’s as many specific numbers thrown around of the power being used. Which is it? Do we know, or don’t we?

But the clincher here is that the “industry” which “has long argued that computerizing business transactions and everyday tasks like banking and reading library books has the net effect of saving energy and resources” used, apparently 76 billion kW-hours 2010, “but” the paper industry used 60 billion kW-hours in 2010. I’m sorry, what? The paper industry? Are we counting Post-Its here? Kitchen paper towels? And assuming these figures are accurate, in what universe can we compare the power used by compute devices as a whole with the power used by paper-making companies and not only pretend that there’s any equivalence, but also imply that the computer industry has failed because the paper industry is still huge? And that’s the implication created by the paragraph, no question about it.

The real issue: why?

So as I said at the beginning, the article goes off in what I can only characterize as sensationalistic attacks on data center technology while avoiding the real question: why?

Why do data centers consume so much power? Why are there multiple redundancies? Why is it that utilization is not 100%?

Power consumption

First — data centers consume a lot of power for the simple reason that we’re doing a lot with them. The article touches upon this very briefly. Massive amounts of data are flowing through these systems, and not just by consumers. The data has to be stored, processed, and transmitted, which requires extremely large footprints and therefore huge power consumption. It’s simple physics.

Contrary to what the article states, data centers have undergone drastic evolution in the last ten years, and continue to do so. There’s an incredible amount of work being done to make data centers and infrastructure work better, be more efficient, cheaper, faster, more reliable, you name it. However this leads to some inefficiency. You can’t replace everything at once. New systems and approaches have to be tested and re-tested, then deployed incrementally. There’s no silver bullet.

There’s also the issue of complexity, evolution of requirements, and differences of usage, all of which also lead to inefficiency, which I’ll come back to in the end.

Redundancies

The article strongly implies, more than once, that unnecessary redundancies, fear of failure, and so forth, are one of the key reasons for inefficiency. This is, as I’ve said above, completely untrue. The redundancies that web services today have is not excessive. They are the best way each company has found to solve the challenges they have faced and continue to face. No doubt if you took any one system within any one company and did a deep analysis you could find elements to optimize, but that doesn’t mean that the solutions are feasible. When you deal with complex systems it’s often the case that unintended consequences are lurking at every turn, and the obvious way to avoid a catastrophic failure to is to have backup systems. Airplanes have multiple redundant systems exactly for this reason.

And before you say that comparing an airplane with web services is a bad analogy, imagine the type of disruption you’d have, worldwide, if Google was down for a day. Or if hundreds of millions of people couldn’t access their email for a day. The planet freaks out when Gmail or Twitter is down even for a couple of hours! Even services that are less obviously critical, like Facebook or Twitter, would generate huge disruptions if they disappeared. Dictators that are constantly trying to shut down access to these services know why: these services, frivolous as they sometimes appear, are part of the fabric of society now and crucial in how it functions and how it communicates. That aside, there’s of course the minor matter that they are run by for-profit companies, and if the service isn’t running they aren’t making any money.

Speaking of money, that’s an important point lost in the article. People don’t just, as a rule, throw tens of millions of dollars at a problem when they can avoid it. You can count on the fact that if there was a way to provide the same reliability with significantly less money, they would do it. A natural thing to do would have been to go to the CFO of any of these companies and ask: “look, I just uncovered this massive inefficiency and waste, why are you wasting money like this?” But that would go against the narrative that these companies are just gripped by fear of change and either they don’t care that they are burning through hundreds of millions of dollars for no reason, or they are so stupid that they don’t even realize it.

Over-capacity
Third — data center utilization is not 100%. This is true. But back to airplanes, it’s also true that on a typical flight there’s empty seats, because you need overcapacity. Similarly, no one runs a large web service without at least 25% or more of spare capacity, and in some cases you need more. Why? Many reasons, but, first, there’s spikes. Usage of services on the web (whether directly through a website or through cellphone apps, or even backend services) is often hard to predict, and in many cases even if predictable it is incredibly variable. There are spikes related to events (e.g. The Olympics) and micro-spikes within those spikes that push the system even further. To use a recent example, everyone talked about how at the conventions a few weeks ago Mitt Romney’s acceptance speech generated something like 15,000 tweets/second while Obama’s speech peaked at around 45,000 tweets/second. What no one is asking is how it is possible that the same system that handled 15,000 tweets/second could also handle 45,000 only a few days later. Twitter didn’t just triple its capacity for that speech only to tear it down the next day, right? The answer is simple: overcapacity, planned and deployed well in advance. And if it didn’t have the capacity ready, what would have happened? Would we have seen an article congratulating Twitter for saving power and not running a capacity surplus? Or would the web have exploded with “Twitter goes down during convention” commentary? It’s not hard to guess.

Another cause for over-provisioning is quite simply fast growth. These systems cost millions of dollars to buy, deploy, test and launch. This is not the kind of thing you can do every day. So you plan the best you can and deploy in tranches, to make sure that you have enough for the next step in growth, which means that at any point in time you are running with more capacity than you need, simply because even though it’s more capacity than you need today it’s going to be less than what you need tomorrow.

There’s also bugs. Code that is deployed sometimes isn’t hyper-efficient or work exactly as intended (what? shocking!). When that happens, having extra capacity helps prevent a meltdowns.

In the process of deploying you need to test. Testing requires you create duplicates of a lot of infrastructure so that you can verify that the site just won’t stop running when you release a new feature. Testing environments are far smaller than production environments, but they can still be sizable.

Another huge area that requires fluctuating (and yet, ever-increasing) capacity is analytics. You need to make sense of all of the data, be able to know when to run ads, and figure out where the bottlenecks are in the system so you can optimize it. That’s right, before you can optimize any system you need to first make it bigger by creating an analytics infrastructure so that you can figure out how to make it smaller.

Then there’s attacks on infrastructure that have to be survived, since you can’t prevent all of them. The article makes absolutely no mention of this, but before you can detect that an attack is happening, you have to be able to withstand it. So any half-decent web infrastructure has to have the ability to handle an attack before it can be neutralized.

Being ready for spikes and growth, testing and deployment, data analysis, etc, all of this requires overcapacity.

If the article was talking about the human immune system, it would have said something like “look at all of those white cells in the body, doing nothing most of the time, what a waste.” But the truth is that they’re there for a reason.

The common thread: complexity

One final point I mentioned I’d get to a couple of times is complexity. It’s the common thread among all of these reasons, and it doesn’t make for nice soundbites. Starting with the issue of load-profiles that I touched on at the beginning, all the way to the problem created by constant (and sometimes extremely fast growth) at the end, we are facing challenges without precedent, and the solutions are often imperfect.

On top of that, requirements change quickly, today’s popular feature is tomorrow’s unused piece of code, and even within what’s popular you can have usage spikes that are impossible to plan for and that therefore you have to solve on the fly, leading to less-than-perfect solutions.

There are no “well known solutions” because each problem is unique. Even within the same domain (say, social networks) you have a multitude of scaling challenges. Scaling profile pages is drastically different than scaling a page that contains a forum. Even for what superficially appears to be the same challenge (e.g. Profile pages) each company has different features and different approaches which means that, say, Google’s solutions for scaling Google+ profiles has very little in common with Facebook’s solution for scaling their profiles. Even when the functionality is similar, there are a multitude of factors, such as the business model, that drive what parameters you need to scale for each case. There’s simply no one-size-fits-all solution.

The people working to build our digital infrastructure are extremely talented, work extremely hard, and are facing problems that no one has faced before. This is not an excuse, it’s just a measure of the challenge, which we take on gladly. Sure, it’s not perfect, but it’s what we can do. It’s humans, flaws an all, running these systems. The alarmist and sensationalist tone in the New York Times article, coupled with the insinuation that the solution exists but no one wants to use it, is doing everyone a disservice. Solving these challenges requires continued work, incremental improvements, and a lot of focus.

Or, as Ben Horowitz quoted (in one of my favorite quotes of all time) Bill Turpin as saying during their days at Netscape: “There is no silver bullet that’s going to fix that. No, we are going to have to use a lot of lead bullets.”

Amen, brother.

Part 1 of a series (Part 2Part 3)

Follow

Get every new post delivered to your Inbox.

Join 383 other followers

%d bloggers like this: