diego's weblog

there and back again

Monthly Archives: October 2012

kindle paperwhite: good device, but beware the glow

For all fellow book nerds out there, we close the trilogy of kindle reviews for this year, now moving on to a look at Kindle Paperwhite, adding to the plain Kindle review and the Kindle Fire HD.

This device has gotten the most positive reviews we’ve seen this side of an Apple launch. I don’t think I’ve read a single negative review, and most of them are positively glowing with praise. A lot of it is well deserved. The device is light, fast, and the screen is quite good. The addition of light to the screen, which everyone seems bananas about, is also welcome, but there are issues with it that could be a problem depending on your preference (more on that in a bit).

A TOUCH BETTER

Touch response is better than the Kindle touch as well. There are enough minor issues with it that it’s not transparent as an interface — while reading, it’s still too easy to do something you didn’t intend to do (e.g. tap twice and skip ahead more than one page, or swipe improperly on the homescreen and end up opening a book instead of browsing, etc.) but it doesn’t happen so often that it gets in the way. Small annoyance.

Something I do often when reading books is highlight text and –occasionally– add notes for later collection/analysis/etc. Notes are a problem in both Kindles for different reasons (no keyboard in the first, slow-response touch keyboard in the second) but the Paperwhilte gets the edge I think. The Paperwhite is also better than the regular Kindle for selection in most cases (faster, by a mile), with two exceptions being that at the end of paragraphs it’s harder than it should be to avoid selecting part of the beginning of the next, and once you highlight a the text gets block-highlighted as opposed to underlined, which not only gets in the way of reading but also results in an ugly flash when the display refreshes as you flip pages. Small annoyances #2 and #3.

Overall though, during actual long-form reading sessions I’d say it works quite well. Its quirks appear of the kind that you can get used to, rather than those that you potentially can’t stand.

THE GLOW THAT THE GLOWING REVIEWS DIDN’T SPEND MUCH TIME ON

Speaking of things you potentially can’t stand, the Paperwhite has a flaw, minor to be sure, but visible: the light at the bottom of the screen generates weird negative glow, “hotspots” or a kind of blooming effect on the lower-screen area that can be, depending on lighting conditions, brightness, and your own preference, fairly annoying. Now, don’t get me wrong — sans light, this is the best eink screen I’ve ever seen, but the light is on by default and in part this is a big selling point of the device, so this deserves a bit more attention.

Some of the other reviews mention this either in passing or not at all, with the exception of Engadget where they focused on it (just slightly) beyond a cursory mention.

Pogue over at the NYT:

“At top brightness, it’s much brighter. More usefully, its lighting is far more even than the Nook’s, whose edge-mounted lamps can create subtle “hot spots” at the top and bottom of the page, sometimes spilling out from there. How much unevenness depends on how high you’ve turned up the light. But in the hot spots, the black letters of the text show less contrast.

The Kindle Paperwhite has hot spots, too, but only at the bottom edge, where the four low-power LED bulbs sit. (Amazon says that from there, the light is pumped out across the screen through a flattened fiber optic cable.) In the middle of the page, where the text is, the lighting is perfectly even: no low-contrast text areas.”

The Verge:

“There are some minor discrepancies towards the bottom of the screen (especially at lower light settings), but they weren’t nearly as distracting as what competitors offer.”

Engadget:

“Just in case you’re still unsure, give the Nook a tilt and you’ll see it clearly coming from beneath the bezel. Amazon, on the other hand, has managed to significantly reduce the gap between the bezel and the display. If you look for it, you can see the light source, but unless you peer closely, the light appears to be coming from all sides. Look carefully and you’ll also see spots at the bottom of the display — when on a white page, with the light turned up to full blast. Under those conditions, you might notice some unevenness toward to bottom. On the whole, however, the light distribution is far, far more even than on the GlowLight.”

So it seems clear that the Nook is worse (I haven’t tried it) but Engadget was the only one to show clear shots of the differences between them, although I don’t think their screenshots clearly show what’s going on. Let me add my own to that. Here’s three images:

 

The first is the screen in a relatively low-light environment at 75% screen brightness (photo taken with an iPhone 5, click on them to see them at higher res). The second two are the same image with different Photoshop filters applied to show more clearly what you can perhaps already see in the first image — those black blooming areas at the bottom of the screen, inching upwards.

The effect is slightly more visible with max brightness settings:

What is perhaps most disconcerting is that what is more visible is not the light but the lack of it — the black areas are what’s not as illuminated as the rest before the full effect of light distribution across the display takes place.

Being used to the previous Kindles, when I first turned it on my immediate reaction was to think that I’d gotten a bad unit, especially because this issue hadn’t been something that reviews had put much emphasis on, or seemed to dismiss altogether, but it seems that’s how it is. Maybe it is one of those things that you usually don’t notice but, when you do, you can’t help but notice.

So the question is — does it get in the way? After reading on it for hours I think it’s fair to say that it fades into the background and you don’t really notice it much, but I still kept seeing it, every once in a while, and when I did it would bother me. I don’t know if over time the annoyance –or the effect– will fade, but I’d definitely recommend you try to see it in a store if you can.

THE REST

Weight-wise, while heavier than the regular Kindle, the Paperwhite seems to strike a good balance. You can hold it comfortably on one hand for extended periods of time, and immerse in whatever you’re reading. Speaking of holding it — the material of the bezel is more of a fingerprint magnet than previous Kindles, for some reason, and I find myself cleaning it more often than I’ve done with the others.

The original touch was ok but I still ended up using the lower-end Kindle for regular reading. If I can get over the screen issue, the Paperwhite may be the one touch e-reader to break that cycle. Time will tell.

short answer yes with an if, long answer, no, with a but…

Part 3 of a series (Part 1, Part 2)

HERE WE GO AGAIN

I will look at this from one more angle and then I will let it rest here for future reference, since pretty much everyone else seems, not surprisingly, to have moved on. With the aside on how we go about discussing this topic out of the way (and various other digressions) in my post last Sunday, I wanted to focus a bit on what is perhaps the center of the argument used in the Times article. At the very least, elaborating on the flaws the center should get us very close to exposing the feebleness of the rest of the argument’s construction.

At the core of the argument is the following paragraph:

Energy efficiency varies widely from company to company. But at the request of The Times, the consulting firm McKinsey & Company analyzed energy use by data centers and found that, on average, they were using only 6 percent to 12 percent of the electricity powering their servers to perform computations. The rest was essentially used to keep servers idling and ready in case of a surge in activity that could slow or crash their operations.

In my response I took issue with this paragraph in two specific areas: 1) that an average is meaningless without more information (e.g. the standard deviation, for starters) and 2) that the measure of utilization they imply, and I say imply because they never make it clear –another flaw–, was one of “performing computations.” I elaborated a bit on the various types of tasks servers may be performing and noted that you couldn’t amalgamate them all into a single value. This is true, but I want to step back a bit into what utilization means and the unmentioned factor that is on the other side of it: efficiency.

WHERE IT BECOMES CLEAR THAT TERMINOLOGY MATTERS (A LOT)

Semantics time: we need to define some terms. Let’s say, just for a moment, to simplify the discussion a bit, that we’re ok talking about utilization as some kind of aggregate. Let’s say that we ignore the specifics of what the servers are doing and we further assume that we will use a percentage of utilization of the “system,” to some percent between 0 and 100, with zero being the system is doing nothing but its own minimal housekeeping tasks, so zero isn’t really zero, but we’re simplifying, and 100 being the system is fully utilized doing something specifically related to the application at hand. I should start though, with some terminology housekeeping by defining what “a system” is.

Definition 0: I will use the words system and machine interchangeably to mean a particular piece of hardware or virtual instance, typically a server, running a particular piece of software. Mixing up virtual systems and machines with actual hardware is a bit of a shortcut, but since in practice virtual systems must eventually map to a real one, it’s one that I think we can live with. (Another shortcut lies in “typically a server” since, say, network switches should also be part of the equation.)

Definition 1: Utilization: a percentage between 0 and 100 of system load related to the specific application tasks at hand.

I shudder at the oversimplification, but I’ll get over it. Probably.

Now, related to utilization is efficiency. While utilization can be said to be an objective concept that is measurable, efficiency can only really be understood relative to something else, and in the case of the a piece of software as relative to previous versions itself. So for example we can’t really say anything reasonable about the efficiency of a V1 piece of software except with respect to imagined possible changes in the future. Conversely, the efficiency of V2, V3, etc will be defined in relation to whatever version or versions preceded them. Now that we have the definition of utilization, though, we can talk about efficiency in terms of that. So, for example, if V2 uses half the systems that V1 used, then it’s twice as efficient (2x). Or V2 could require 20 machines while V1 required 10, in which case V2 is half as efficient.

Definition 2: Efficiency: the change in utilization between two versions of the system.

Once again — for all data center people out there, I’m oversimplifying for the sake of this particular argument.

In the paragraph quoted from the Times above there’s a bit of a jumble of terms. It starts talking about “energy efficiency” (which, in our definition, would be 100%-Utilization%) and then it talks about using “they were using only 6 percent to 12 percent” which is straight utilization%. I think playing fast and loose with terminology like that can get in the way of really knowing what we’re talking about, which is why I’ve spent some time defining, at least, what I’m talking about.

INSERT OBVIOUS TRANSITIONAL SECTION TITLE HERE

Ok, fine, the squirrel in charge of coming up with section titles is on a break and I can’t think of a good one, so let’s just keep going now that we’re armed with these terms. I’ve repeatedly stated that the average utilization, by itself, is meaningless. Allow me to elaborate on one key reason why. Suppose you measure ten systems and get the following utilization values: 10, 10, 10, 10, 10, 90, 90, 90, 90, 90. This gives you an average utilization of 50% — already something to note, since the average clearly isn’t telling the whole story. Assuming for a moment that these values are comparable across systems (a big if, but again: simplify!), the average is really hiding something important: five systems have “bad” utilization (10%) while the other five have “good” utilization (90%).

But wait, why am I using quotes around “good” and “bad”? Because this is something the article implies: that more utilization is better, but this is exactly why talking about efficiency matters. Maybe, just maybe, the five systems at 10% are actually systems with software that do the same thing but more efficiently, and from that perspective our assessment of good or bad can get inverted.

Suppose the five systems at 90% are part of a cluster. Suppose one of the five system crashes. Suddenly the load that went to that system has to be distributed to all the others and it quickly puts the remaining four systems at over 100% and the entire cluster could crash. The load would never get distributed perfectly across systems among other problems — in reality we’d probably looking at a cascade of failures as each individual remaining system crosses the 100% threshold, putting more load on others, and so forth, ending with the final state where the entire cluster is down and everyone, from the CEO to your users, is screaming bloody murder.

Suddenly we’re looking at those systems at 90% with some suspicion, no? In fact if you keep on the simplification vibe, you could argue that for a five-system cluster the only way to protect it from any one machine going down (assuming a crash at 100% load, also not a given…) is to maintain the average utilization at around 75% or so, which in the event of one system crashing would leave the remaining four at 93.75%. 80% average would mean one system crashing leaves everyone at 100%. So the difference between 75% and 80%, which seems minimal, is the difference between life and death in this scenario.

But I said can get inverted because there’s other factors at play. Take just one: uptime requirements. Suppose you’re somehow OK with the idea that one system crashing takes everything down, and you’re willing to trade uptime for costs (ie., using only five machines). Then you’d be ok, probably. Everyone I know, however, would want to protect from this eventuality.

This scenario isn’t contrived to get the answer to come up the way I want. It is typical, certainly far, far more common than a simple and straightforward DC setup components get swapped instantly, you’re always at maximum efficiency, and every system does the same thing. Reality isn’t like that. Speaking of which…

A (SMALL) DOSE OF REALITY

Let’s complicate things a bit further, or rather, make them just slightly more realistic. Let’s say that your 90% utilization was measured during a time where you had 100 simultaneous users on average. Tomorrow, though, Pando Daily posts a glowing review of your website and usage doubles. Doesn’t sound too far-fetched. Now what? The 90% utilization, if there’s a strong correlation between simultaneous users and load (which is typical), is suddenly guaranteed to bring you down. Suddenly you’d need much lower baseline utilization to be able to handle that spike, and the 90% looks like a bad idea once again.

Going a bit further, imagine the two clusters of five systems are actually performing the same function, but one of them is a V1 (90%) and the other one is the more efficient version you just deployed (10%). This is something that happens all the time. As you can instrument a piece of software running under real-world load, you can understand better what leads to that load, you can optimize your software, and sometimes massive jumps in efficiency are common. So when the 90% is deemed as a problem, the team gets to work, and they come up with a V2 that is around an order of magnitude more efficient.

Which leaves you with two possible versions of the software doing the same work, but one uses 10% of the systems and the other 90%. Which one is better? I think everyone would agree that the newer, more efficient system is better, even though if you deployed it, it would suddenly make your utilization plummets across the board, which, according to the article, is “bad”. Oops.

Hold on, I hear you say. Now that you’ve got software that is more efficient why don’t you just decommission the extra systems you don’t need? Then you’d be using less power overall and you could cut down, say, from 10 machines at 90% to 5 machines at 20%, which should give you a nice margin for error.

Aha! This surely sounds true, but this is also where the oversimplifications I keep bashing get tricky since real world gets in the way. Two interrelated points. First, there aren’t that many (if any) tidy switchovers from V1 to V2. In increasing efficiency you may have introduced bugs. To protect against that, you start having to test (more machines for that!), and even when testing says it’s ok to deploy you will start small — deploy to a few machines only, wait, verify. Deploy a bit more. Wait, verify.

THE MISSING VARIABLE: TIME

The process we just described has an important variable that we haven’t looked at so far: time. As in, in the process of doing this, time passes.

Sounds obvious right? But if this is true, it’s also true that as time passes, requirements change. Requirements, here, encapsulating all the factors, external and internal, that go into delivering your service or product. Whatever you’re doing, you’re not dealing with a static entity, but something that is constantly evolving, both because you’re changing it from inside as you update the software, fix bugs, evolve architecture, and deploy new hardware, but also because the external factors are constantly in flux. Likely, you will now have more people using the systems (or, for things that are just APIs, maybe more machines). Or load may have changed due to a feature.

The passage of time here is the critical element missing, and it gets in the way of the ideal scenario in which we allow the system to truly “contract” and use fewer resources in terms of power. By the time you’re sure you can decommission those extra machines, it’s quite possible that you now have other uses for them, and even if you did find a sliver of time to do this you may not have the option since system growth, or feature changes, or whatever, may be telling you that you will need those extra machines in a week or a month month, and therefore taking them down would be simply a waste of time. At Ning, for example, our overall hardware footprint remained largely stable in terms of number of machines, and therefore total power consumed, even as we went from 1 MM registered users to 50 MM and beyond. Setting aside the fact that this took an enormous amount of work and constant vigilance, it also meant that the system that could handle 50 MM users with the same amount of hardware as for 1MM was very different than the original. And throughout that process, many machines would oscillate in their degree of utilization, from the low teens to the high 70s. Over a period of a few years you could take measurements at different points in time that would either make us appear either like geniuses or criminally stupid — if all you looked at was utilization, and if “less utilization is bad” was your only guiding principle.

If we can agree that low or high utilization alone is meaningless, and that your utilization will fluctuate, maybe drastically, as you optimize, we can start to ask more appropriate questions. For example: can we do better in terms of releasing idle capacity as efficiency increases? Absolutely, and the widespread use of virtualization in recent years also means that it’s now far easier to have capacity fluctuate along with load. The APIs for system management popularized by public cloud infrastructure companies (e.g. EC2, Rackspace cloud, etc.) have led in the last couple of years to more and more services where capacity is instantiated on demand according to load, leading to a more efficient use of those virtualized resources.

Even there, though, we have a problem, for if EC2 allows you to instantiate and tear down a hundred AMIs without giving it a second thought, it’s also necessarily true that there must be an actual hardware footprint doing nothing but waiting around for that to happen. Having people constantly disconnect and reconnect machines is something that is not just not feasible, and that in almost every case will involve as much waste as taking a system down, since in the world we live in the capacity we need to run the ever-increasing complexity of Internet infrastructure keeps going up, which means that whatever you stop using today, you’re guaranteed to need tomorrow. With 60% or more of humans the planet still not online, there’s clearly still a lot of growth left. For the biggest services, it’s common to have to constantly be deploying new capacity just to account for growth.

This leads us to yet another question, perhaps the last one in this particular chain of thought: Should we in general accept less reliability from online services given that it has real environmental impact, and real cost?

My own answer to this would be a clear NO. I don’t think you can have these systems simultaneously be part of the fabric of society (a point I’ve made before) and have them be “partially reliable,” just like there’s no way to be “partially pregnant.” Reliability is intrinsically tied to the usefulness of these services. Perhaps there are ways in which we can bake in more asynchronous behavior in some cases, but when a lot of what systems do is real-time, 24/7/365 and worldwide, this isn’t something we’ll be able to exploit frequently. We have crossed the Rubicon, so to speak, and have to see this through.

THE CONCLUSION, OR, BELATEDLY MAKING SENSE OF THE TITLE OF THIS POST

Utilization in data centers is an important issue, but talking about it bereft of context is not really that useful. In particular, without also talking about efficiency, and all the parameters that go into it including what kinds of applications are running, what the goals are, what the requirements are, etc., is going to leave us with nothing but incomplete answers, and here incomplete will leave us way too close to incorrect for comfort.

And this isn’t just about the Internet. Is it a valid question to talk about utilization in, say, TV stations? Or other major source of media for that matter? Pre-digital music distribution…?. Sure it is, but it’s not something we focus on because the use of energy in other media is a one or two steps removed from what we see, so it’s easier to ignore even if it’s there all the same. When was the last time you remember a TV station couldn’t broadcast? Are we to believe that they never had a power failure? No. They have backup systems. Those evil-sounding lead batteries or diesel generators.

Context.

By looking at context and simply shifting the assumption that Internet infrastructure is actually run fairly well, given all the requirements and its rapid evolution, we realize that what we really should be wondering about what makes us build systems this way, rather assuming they are not built properly. Should we talk about utilization? Or is this really about what drives it, and therefore utilization is part of the discussion but not the central point?

Channeling Reverend Lovejoy for a moment, we could say, then: “Short answer yes with an if, long answer, no, with a but…”

Follow

Get every new post delivered to your Inbox.

Join 383 other followers

%d bloggers like this: