diego's weblog

there and back again

Monthly Archives: July 2012

the dark knight rises: an epic conclusion

As I wrote eight (!) years ago, Batman is my favorite superhero character. With Batman Begins, The Dark Knight, and finally The Dark Knight Rises we finally have a movie saga worthy of an iconic comic book character that is unlikely to be topped any time soon. It is epic, a great conclusion to the best superhero trilogy ever put on the big screen, and if you enjoy movies you should go see this one in a theater, were it can be experienced it as it was meant to.

I’ve said before that Batman can be defined in contrast with its enemies. If the Joker in TKD was pure id, Bane in TKDR is much more calculating ego and even anti-super-ego (even if that is mangling Freudian theory to some degree), and it suits the evolution of the overall story well, connecting the enemy of Batman Begins (The League of Shadows) with the unrestrained anarchy of The Dark Knight represented by the Joker into one, with more impact than Knightfall, the comic book series that foreshadows some of the movie’s plot.

The movie isn’t perfect. Parts of it feel rushed, and certain plot points seem at times pulled out of thin air, others can be seen coming a mile away, three areas in which The Dark Knight was superior. There were times when I found myself admiring the scenery rather than being immersed in it, and I would have liked at least an oblique reference to the fate of the Joker (if there was one, I missed it in what is at times fast, mumbled or distorted dialogue). These minor failings don’t subtract too much from the movie in my opinion, and the movie’s climax is superior to that of The Dark Knight.

The Dark Knight Rises is Christopher Nolan at the top of his game, a master of cinema weaving an extremely complex story with great skill (although perhaps not to the level of Inception), and I can’t wait to see what he does next.

ideas are bulletproof

An act of violence like what happened in Colorado is not really something that we can make sense of, as much as we might try. It is sad but true that this was “just” the act of someone who’s clearly mentally unbalanced, with effortless access to assault weapons and riot gear when he clearly should not have been able to purchase anything more deadly than a set of plastic scissors.

If the first part of The Dark Knight Rises (more on the movie in the next post) is to some degree an expression of the idea that one person can’t push back on an ocean wave crashing on the shore, the second part embodies the notion that how we react, what we do, in the face of forces beyond our control matters. Standing up to something matters. The wave eventually recedes.

I admit that I wasn’t completely unconcerned about going to see the movie last night. The lizard brain is hard to completely quiet down. But I went, and so did a lot of other people. And there was something reassuring, small and yet valuable, about that.

I am not, by any means, saying that going to watch a movie represents some kind of a deeply held moral stance or a profound act of strength of character. Not at all. First, it echoes too much of the post-9/11 notion that “shopping is patriotic” (I’m paraphrasing–you know what I’m talking about). Second, I have no doubt some of the people that went did so mindlessly, that is, without specific intent. But I also have no doubt that for a lot of people there was a kernel of fear in their minds, and what matters is they got over it, and went on with their lives. Some people probably didn’t get over it, and didn’t go — and that’s fine too. I’m not talking about individual actions here, but about the reaction of the collective. Empty theaters on Friday night after what happened at midnight on Thursday would have been a bad sign. A sign that we as a group had given up, retreated to some degree in the face of what’s in essence a world that is beyond our control, even if we like to tell ourselves that it isn’t.

So if millions of people going to see a movie in spite of fear isn’t sudden proof of a culture-wide show of courage, it is also true that there is something important in that people did do it: the simple but powerful idea that life, down to its most routine and perhaps even frivolous moments, is worth living, not only when we can protect ourselves from every possible danger and somehow live without fear, but precisely in spite of the fact that we can’t.

Adding to something I said at the end of this post many years ago: Ideas are bulletproof — but only if we believe in them.

nexus 7 and the android experience consistency conundrum

“So now the home screen is locked to portrait mode?”

This was one of the first questions in my mind during the first few minutes of using the Nexus 7, which finally arrived yesterday. After a few hours of use, I have to say: I like it. So what’s this about portrait mode them?

I’ve used the original 7 inch Galaxy Tab, running Froyo (Android 2.2) first, then Gingerbread (Android 2.3), which had the UI locked in portrait mode. I own a Galaxy Tab 10.1, which at the moment runs Android 3.x (Honeycomb, apparently to be updated soon to 4.0 Ice Cream Sandwich) and in which the home screen UI is locked in landscape mode. (Note: while the Kindle Fire is an Android device, the fact that Amazon has modified it to the degree that they have makes it unsuitable in my viewfor inclusion in a list of “Android tablets”).

Now with the Nexus 7, running Android 4.1 (Jelly Bean), the home screen UI is again in portrait mode. Although we have to use “again” carefully here since the original Galaxy Tab was really just the phone OS/UI installed on a larger slate, and not really designed for tablets.

Why does the home screen orientation matter? The fact that the home screen UI is now locked to portrait mode may seem like a relatively minor thing, and it is, but I think it is representative of a larger issue facing Google with Android in general: they need to decide when something is good enough, and stop making major changes for a while.

In the tablet space Google hasn’t really had a Google-branded flagship device before the Nexus 7, so we could chalk up some inconsistency to that. But Google has released phones under the Nexus brand for a while now, and every iteration has been different. I own and have used a Nexus 1, Nexus S, and Galaxy Nexus, and while these are all great devices, and in my opinion the best Android smartphones for each respective generation, every new device with its corresponding new OS has made significant, and often bewildering, UI changes.

“Primary” UI buttons (ie., the equivalent to Apple’s home button) have gone from hardware to software. Their number and functions have changed. Defaults have shifted significantly with each release (even when restoring settings from the same Google account). The store has undergone significant changes and rebranding. Jelly Bean’s home screen by default now greets you with a magazine’like interface to highlight content from your “library,” also a new concept post-introduction of the Google Play store. Under the covers, APIs have undergone a dramatic (and, overall, welcome) improvement, but every release feels somewhat disconnected from the previous by making major changes to what apps are supposed to do.

Now, don’t get me wrong–Android initially -and for years-, from the UI to its APIs, was inferior to iOS in my opinion (and yes, I’ve developed and released software on both), and with Jelly Bean, and the Galaxy Nexus and Nexus 7 hardware platforms, Google finally has something that is at least up to the challenge, so continued iteration has paid off in that regard. Additionally, an under-appreciated factor in making drastic changes is that Android’s market share on tablets has been tiny, which gives them an opportunity to evolve more quickly.

Android would benefit from less fragmentation of both versions and experience, and a faster update cycle. Part of getting there requires Google to finally settle on the major features of the Android experience and evolve more incrementally in the next few releases. iOS has a real advantage in uniformity of the experience (both in terms of hardware and software) across devices: if you know how to use one iOS device, you know how to use them all. This hasn’t been entirely true of Android devices until Jelly Bean and the Google Play Store.

The wildcard in this are the OEMs. They seem to be addicted to making unnecessary modifications and customizations that add little value and are actually counterproductive in that they invalidate, to varying degrees, the knowledge that a user may have about Android from other devices. Their incentive is actually to make it harder, not easier, to switch to another manufacturer — another advantage Apple has.

With the Nexus 7 and Jelly Bean, Google has a chance to establish a dominant device and experience that could have the effect of forcing the OEMs to see the value in consistency, and over time perhaps this can also trickle over to the Android smartphone space, something that will improve the lives of developers and users alike.

Here’s hoping. :)

the wrong metaphor

“The end of television and the death of the Cable TV bundle.”

Such is the title of this article in the Atlantic published a few days ago. Every time I see one of these articles pop up, I roll my eyes and wonder when we’re going to stop treating markets, technologies, products and services as if they were living things, and therefore as if they could die and disappear from one moment to the next.

Because, the thing is, they don’t.

AOL still has something like three million dial-up subscribers. Three million people paying hundreds of dollars a year! Vinyl is still around. So is radio. Windows, which is quickly becoming less relevant, sells three hundred million copies a year or more. Books have been around for hundreds of years. Even print newspapers, under relentless pressure from digital media, also sell hundreds of millions of copies every day. And on, and on, and on.

Technologies that have reached mass adoption can’t be “killed” by other technologies that may replace them. Very large markets have a momentum of their own, and even as the generation of people that grew up with them passes away, there is usually some level of replacement as some people in the new generation carry it forward. Even products, which are specific instances of a technology or specific solutions to a certain market need, take a long time to die off, and can really only be “killed” by whoever is making them. Even then, many long-time users will hang on to them for as long as they can.

I think we have trouble seeing this accurately because of the multi-generational timespans involved. Not unlike the trouble we have in reacting to long-term, multi-generational challenges like global warming.

A more accurate way of describing this process would be to say that markets, technologies, and products fade away, rather than “reach their end” or “die”. Some fade away very slowly, others a bit faster. If the technology allows it, there can be tipping points were a transition to another dominant technology happens fairly quickly, perhaps in a few years. One example of this would be Facebook vs. MySpace, where the switching costs involved are so low-cost (from the perspective of the user) and so strong, as far as self-reinforcing feedback loops go, that they can happen in a few years. Even so, MySpace is still around, and the situation isn’t as clear-cut as we’d like, with Facebook “killing” MySpace, because if anyone was primarily responsible for MySpace’s fast decline, it was MySpace itself.

This is important because how we talk about something matter as much as what we think of it. What we have to avoid is the wrong metaphor becoming the basis for an inaccurate mental model.

Ok, I get it. A headline that says “TV will fade away faster” is less catchy, and sounds stranger, than “TV is dead.” Mass media doesn’t do nuance and subtlety well. But if we can’t expect it to be gone from the headlines, we should keep in mind that this isn’t how the world really works.

diego’s life lessons

Excerpted from the upcoming book: “Diego’s life lessons: 99 tips for survival, fun, and profit in today’s baffling bric-a-brac world.”

#1: Hide in a cupboard

We start the series with perhaps the most important of all lessons: you should spend most of your life hiding in a cupboard.

The ever-growing focus on “safety” in our society, while laudable in its pervasiveness and intrusiveness, doesn’t go far enough.

Life is full of dangers: sharks, tigers, spiders, sharp-edged furniture, volcanoes, you name it. Consider: just being alive guarantees that at some point you’ll be dead! This is unacceptable. In my own scientific analysis, cupboards are the safest place to be. Here’s a few reasons:

  • Cupboards are cool and dry, which coincidentally match the conditions required by most dried or canned foodstuffs to be appropriately stored. Good nutrition is important.
  • Tigers mostly confine themselves to wildlife areas (zoos, your backyard)
  • Sharks need water and can’t really travel far without it, so you won’t find them more than a few feet away from the bathtub.
  • Spiders don’t like beer. Logic dictates, therefore, that they avoid the kitchen.
  • Volcanoes exist in remote areas with weird names, like Krakatoa or Eyjafjallajökull, which is clearly not near your cupboard. Unless you live in Krakatoa or Eyjafjallajökull, in which case I suggest that first you move to another country, and then hide in your cupboard.

All in all, cupboards are excellent locations to retreat to, whether you want to avoid watching election results, ride out the apocalypse, or disrupt your phone’s cell signal so you can play Angry Birds in peace.

Note: In case of other dangerous situations (e.g., a highway nearby) I also recommend wearing a helmet and kneepads when in the cupboard, just in case.

#20: Own and use regularly at least one Windows PC

Maybe you switched to Mac a long time ago. Maybe you’re truly enlightened and run your own Ubuntu Beowulf Cluster in your basement. Whatever the case, not using Windows regularly is a crucial mistake. Specifically, a 2- or 3-year-old Windows PC running the latest version of Windows. (This ensures Windows will run, but just barely.)

A near-death experience will give you a new appreciation for life. Skydiving with a broken parachute, swimming with sharks in blood-soaked water, fighting a Kraken, all of these are good options for that. But using a Windows PC for a few minutes will achieve the same result with less than half the chance risk of death or injury.

And if you’re a real daredevil, having three or four Windows PCs and attempting to network them is guaranteed to get your heart pounding. True, it’s unlikely you will actually succeed at networking them, but the experience is what really counts. Another option for thrill-seekers is to start using the latest version of Office, sight-unseen, when faced with a non-negotiable deadline. If you do this, make sure to turn on Clippy, aka “Office Assistant”. He will be to you what Wilson the volleyball was to Tom Hanks in Cast Away.

Additionally, the original Minesweeper experience requires Windows, and if you haven’t played Minesweeper on a 200×200 grid, you haven’t really lived.

A corollary to this rule is that you should buy a new Windows PC at least once a year. You will engage in the thrilling process of figuring out whether you should get an AMD Phenom or an Intel Core 2, or find out exactly what the difference is between an nVidia GTX 550Ti, 560, 560Ti, 570, 580, 590, 670, 680, 690, or if you want to go retro and get one of the GT line, or GTS line, or the 4xx line, or even decide that what you really want is one of the many fine  ATI cards. (Like choosing one of the hundreds of types of cereal at a supermarket aisle, choosing video cards in the PC world is a wonderful experience that is guaranteed to keep you entertained for days.) Once you order it and get it 6 to 8 weeks later, turn it on just to experience the blast that is the instantaneous update process, as gigabytes of mandatory updates download and install. Later (much, much later) peruse all the pre-installed software and offers. Sign up for as many offers as possible, including, if possible, AOL Dial up, and then attempt to cancel them. Spend some time talking to technical support, rebooting the computer, unplugging it and replugging it. When you’re done, return it. No need to specify a reason. The people at the return center already expect it. In the process, you will help the economy by keeping the service industry humming along.

#47: Avoid nuclear detonations

An important rule to follow — being far away from nuclear detonations when they occur is a must if you want to keep on commuting, enjoying non-fat decaf soy chai lattes, and generally breathing. You may be familiar with nuclear explosions from that documentary by James Cameron about killer robots that will take over the earth in the near future (the one he did before going to Titanic to find Kate Winslet’s necklace), as well as countless home movies made by the US Army of houses being blown away and generally left a complete mess. The sheer forces of destruction, surface-of-the-sun temperatures, and blinding flash of light (not to mention radiation) are bad enough, but here’s what they don’t usually tell you: nuclear detonations have a side effect called an EMP, which wipe out electrical equipment far beyond the actual blast radius.

That’s right. No TV. No internets, which means no Wikipedia, or videos of animals doing funny things. No phone (for AT&T iPhones, same lack of ability to make calls, however). No blender. No ice. No ice! If there’s a measure of how far civilization has come, it’s the unregulated, unlimited flow of ice in the dwellings of common folk. Without ice, you will lose the ability to produce many common cocktails, and you won’t be able to create any ice sculptures. And who wants to live like that? In a cocktail-less world with no ice sculptures? Seriously.

In short: if you see a very large, very bright mushroom cloud in the distance, board the nearest plane that works and get away from it. Preferably not traveling to Krakatoa or Eyjafjallajökull (see rule #1).

#68: Aliens do not come in peace

Less a “life lesson” than a straight-up fact of the universe, it’s something that should nevertheless be always kept front and center. When you find yourself (as we’re often wont to doing) in a typical Iowa cornfield in the middle of the night, after having run out of gas, and a shiny spaceship lands in front of you, the rule is simple: DO NOT TRUST THE ALIEN.

Here’s a handy guide of how to respond to various first-contact situations:

  • If the alien majestically walks down from his/her/its spaceship, extends their hand/leg/tentacle and says/whispers/grunts “We come in peace”, shoot him/her/it.
  • If the alien is a tiny crab-like thing that wants to attach to your face and has acid for blood, shoot it.
  • If the spaceship looks like a car and the alien looks human, shoot it twice. Especially if they claim not to be an alien. Those are the most dangerous.
  • If the alien has a bizarre mask and dreadlocks, distract him by placing a cardboard cutout of Arnold Schwarzenegger from the movie Commando to your side, then shoot it. Naturally, this requires you carry said cardboard cutout with you at all times, preferably on the passenger seat for easy access.
  • If the alien is some sort of gelatinous blob that would not be affected by shooting, just run. Gelatinous blobs are never fast.

The one exception: when you find the alien in your backyard shed, and it likes Reese’s Pieces. In this case, attempt to confirm it’s peaceful by  verifying the alien is pliable to cross-dressing and wearing Halloween costumes. Then place it in your garage so he can build some intergalactic phone equipment, and start preparing for unnamed government agencies to descend on your property, by, for example, heating up the coffee and getting some donuts. It doesn’t hurt to be polite.

skeuomorphic software, invisible hardware

A number of articles in the last few months have argued against the increasingly common use of skeuomorphisms in UI design. A recent one, that is also a good summary of the argument, is can we please move past Apple’s silly, faux-real UIs? by Tom Hobbs. A key point these arguments make is that software shouldn’t necessarily try to imitate the physical object(s) it is replacing, since we are both encumbering software with constraints it doesn’t naturally have, and we’re missing the opportunity to really leverage the malleability of software interfaces to create entirely new things.

In the case of Apple, though, I think there may be a reason beyond those usually associated with the use of skeuomorphic design, one rooted in a view of their products as a deeply integrated combination of hardware and software.

Before going into it in more detail, I actually agree with the general case against overuse of skeuomorphisms — I think that we have not done enough as an industry to explore new ways of creating, presenting, and manipulating information. There’s definite value in retaining well-known characteristics for UIs for common tasks, but the problem is when we simply substitute the task of designing a UI with copying its real-world equivalent. We haven’t scratched the surface of what is possible with highly portable, instant-on, location-aware, context-aware, always-connected high resolution touch-based (or not) hardware, and just copying what came before us is unnecessarily restrictive.

The case of Apple is slightly different, however. They don’t just produce software, they design and produce the whole package. Arguably, a lot of the success of iOS devices hinges precisely on the high level of integration between hardware and software.

So the question is, if we consider the whole package, not just the software, does that change the reasoning behind Apple’s consistent move towards skeuomorphic UIs? I think it does.

Consider the hardware side of the equation. With every new generation of hardware, whether iPhone, iPad, Mac, or even displays, Apple moves closer and closer to the notion of “invisible hardware”. In recent product introductions they’ve frequently touted how, for example, the iPad or the Retina macbook to some degree fade into the background: it’s just you, and your content. This materializes in many ways, from the introduction of Retina displays to the consistent move towards removing extraneous elements from displays (no product names, no logos — just the bezel and the display).

I’ve written about this before when I discussed the end of the mechanical age. Apple has been for years moving towards devices that disappear from view even as you’re holding them in your hand, making them simpler (externally), even monolithic in their appearance, just slabs of aluminum and glass. Couple this with a skeuomorphic design approach for the software and what you get is a view of the world where single-purpose objects fade away for those that can essentially morph into the object you need at any one time.

In other words: I think Apple’s overall design direction, implicitly or explicitly, is that of replacing the object rather than just the function.

Today, this can be done with invisible hardware and skeuomorphic software. In the future, barring the zombie apocalypse or some such :-) we could have devices based on nanomachines that in fact physically morph to take on the characteristics of whatever you need to use.

As I said before, I think that we should be exploring new user interfaces and letting go of the shackles of UIs created decades or even centuries ago to find new and better ways of interfacing with the vast ocean of data that permeates reality. Apple’s approach in the meantime, however, (regardless of my personal preference) strikes me as a valid direction that is not at all run-of-the-mill overuse of skeuomorphisms, but something deeper: a slow but steady replacement of inert physical objects with ones that are a malleable –and seamless– analog UI replacement, with a digital heartbeat connected to the datastream at their core.

another theory on why mac pros were not updated

Post-WWDC 2012, many of us in the nerdsphere were disappointed by the lack of a significant Mac Pro update this time around. Soon afterwards many reports surfaced, including some from within the mothership that new designs are coming next year, perhaps by late summer or early fall. The reasons for the delay aren’t clear. Marco Arment originally speculated that it may have to do with Cinema Retina Displays in a Build & Analyze podcast, then discarding the theory the following week for the (perhaps more likely) reasons outlined by Ars Technica. In a nutshell, this theory says that the delay is related to Xeon die size, and perhaps chipsets– in real-world terms, inability on Intel’s part to meet Apple’s requirements for heat, power, or feature support, such as Thunderbolt or USB3.

While this seems like a reasonable explanation I think there is another factor at play: the market, both in terms of size (units sold) and of who participates in that market.

As for market size, It seems to me that if Apple (and/or Intel) really wanted to push this forward quickly they could, and if they’re not doing that it’s more likely that it’s in part because the market size or internal metrics don’t make a strong case for it. Apple, more than other manufacturers, is exceptionally disciplined and systematic about how they update their products. They don’t follow Intel’s timeline in the way PC manufacturers do, adding new chips and new systems pretty much every time Intel can manufacture them in volume–if anything, sometimes it even seems that Intel closely follows Apple more than the other way around. Apple sets their own timeline, and they do so not based on the components available or the ever-present rumblings of the tech press (“Apple will surely release X now that competitor Z has done it!”) but based on a deep understanding on who is buying and using their products.

Which leads me to the “who participates” part of my theory. Unlike other manufacturers, Apple has a lot of knowledge of who is using their products, not only because purchases are generally tied to an iTunes account, which is tied to the Apple Store, but also because since MobileMe, and now more broadly with iCloud, we typically sign into all the machines that we own with one ID. I’m sure a significant percentage of Apple developers sign in with iCloud in all their devices, at least large enough to reach statistically valid conclusions. They can use this to understand update cycles, simultaneous system use, you name it.

I think we’d all agree as well that the people likely to get a Mac Pro often will also have a Macbook Pro, and that it seems plausible at least that those that fall into the Mac Pro buyers group update often. So if you’re about to unveil a kickass high-end Retina Macbook Pro, which would, at the beginning at least, be most attractive to developers and nerds, a group which surely overlaps with the group that would has also been waiting for a Mac Pro refresh, what do you do? If I was faced with the choice of upgrading a Mac Pro, last refreshed more than two years ago, or a Macbook Pro, of which I likely have a newer model, it would be a no-brainer. Retina can wait. And since there’s not that many people that could upgrade both (too expensive), that would have a noticeable effect on Retina Macbook Pro sales out of the gate. Ensuring that demand exists for a product, to the degree that you can control it, is a good idea. A situation like the Retina MBP finds itself in, where it’s supply-constrained, is clearly desirable. And you can afford to let down people who really want a new Mac Pro, since you know it’s a small market that has nowhere else to go to get a competitive product, whereas leapfrogging everyone in the portable market with a high-end Retina MBP has far-reaching consequences not just for that small market, but for your broader position in the portable market as well, and cements your lead in it.

So while we’ll likely never know exactly why the Mac Pro wasn’t updated, I think it’s a fair bet that the Retina MBP release at least played a role, as far as Apple trying to ensure that they were maximizing the potential market demand for it. At least that’s what I’d do if I were them. :)

cargo cult troubleshooting

I recently started listening to Hypercritical (fantastic show btw). During last week’s  show (#75 “Just A Dinosaur”) Dan Benjamin & John Siracusa discuss the problem of corrupted binaries on the app store that Marco Arment first brought up. The discussion starts around minute 50 of the podcast I think. In the process of talking about that, they referenced previous download problems from Apple that Dan had and how the feedback he received was a whole host of measures that included disabling packet flooding on your router, port scan detection, among other things, that I call “cargo-cult troubleshooting”.

This is not, btw, not a criticism of Dan or of the people that offered help, but rather an attempt to codify a particular behavior we engage in that is all-too-common in solving problems with complex systems. I’ve seen this before, I’ve done it before (I think we all have at one point or another) and it seems interesting to figure out why we do it.

Why “cargo cult troubleshooting”? Wikipedia has a good article on Cargo cults. Briefly (and avoiding the religious overtones), we could say that Cargo cults attempt to reproduce the observed conditions under which something happened thinking that it’s those conditions, and not an external factor outside of your control, that made it happen.

A walk down troubleshooting lane

To look at why this happens, let’s start with this particular example — let’s go through this specific problem-solving process and then look at some possible root causes.

Problem

  • Downloads for software updates from Apple fail repeatedly to validate (and therefore install) from every machine in your house.

Diagnosis

What we know with a high degree of certainty is:

  • The files must be either incomplete (ie a broken download) or the download completed but was somehow corrupted, therefore failing the validation against their signature.
  • As a first step, you verify (as Dan did) that downloading from a different geographic location produces no errors. Therefore the problem is location-specific. This rules out a widespread Apple problem.
  • Because you also have multiple machines at home, you can verify that it’s not machine-specific. This rules out problems in one machine, or a failing disk drive.
  • Rebooting router, machines, etc., has no effect, so the problem isn’t related to the state of the machines on your end.
  • It’s Apple-signed-binaries-specific. Other downloads of any other type work fine for you, including other downloads from Apple, such as visiting their website (in essence, a download of HTML, CSS, JS, images and other data, I am assuming that apple.com and other Apple properties work, and this is an important clue). Even more, it seems to be at the very least Mac-specific, so that iOS installs work. I am also assuming, probably correctly, that iOS app install/update was unaffected, even when using the same network. iOS apps are also signed and distributed by Apple using the same infrastructure, so this is another important clue.
  • Searching online shows many other people, in several different locations, reporting apparently the same problem. The “apparently” will be something we’ll come back to later, but for the moment let’s assume it is the same problem.

The easy part of the diagnosis is over. Before starting to fiddle with all settings everywhere, let’s see how much farther we can go in identifying the cause. We’re left with a few components that could be the root cause:

  • Apple’s servers/process related to Mac downloads,
  • Apple’s CDN (probably Akamai)
  • more broadly, the route between your house and the servers
  • your local network
  • your ISP
  • your router

Start with the possibility that the downloaded binary is complete but corrupted/invalid. We know that TCP sockets, which underlie HTTP connections, have error correction built in. A valid TCP connection will deliver the same data at the receiving end that was sent at the sender end, so the file that is arriving at your machine (if complete) will be what the server sent (short of an incredibly sophisticated man-in-the-middle attack). Additionally, since we know that other downloads work, in particular other non-signed-binary downloads from Apple work, the network route end to end is fine, and so is your ISP, at least as far as full downloads are concerned.

So if the file is downloaded fully but still broken it means that the server is sending a full, but corrupted or incorrectly signed file. This is the first possible cause.

Now, as far as the download being broken or incomplete — Is it possible that due to bizarre settings or a bug your router is bailing out (or your ISP blocking traffic) after downloading some amount of data, therefore leaving you with an incomplete file? Unlikely, perhaps, but not impossible. The fact that it only happens with Apple binaries makes it even less likely (a bizarre Apple-specific setting in your router perhaps? ISP rate-limiting?). Similarly the likelihood that this could be a widespread problem and be a characteristic of some modem that somehow only affects Apple’s CDN servers is also low to nonexistent. However, let’s say this is the second possible cause.

The rest of the scenarios involve either Apple’s signing process failing or one of Akamai’s or Apple’s servers involved in the storage having a corrupt image, disk, or software problems that then serves out invalid binaries to a location for a specific time. This is the third possible cause.

So we have whittled it down to three possible causes:

  1. Apple or the CDN is serving a complete but corrupt (or incorrectly signed) binary. In the case of an incorrectly signed binary, this can be only Apple’s, and not the CDN’s, problem
  2. Your ISP or router is consistently interrupting only signed-binaries, and only from Apple (keep in mind we already decided that the possibility that a complete file was corrupted en-route short of an attack taking place was nonexistent because even in the rare case of a bizarre (or even unheard of) router malfunction, the probability that this bug was affecting just apple signed binaries was similar to that of an elephant suddenly levitating due to quantum fluctuations around the elephant)
  3. Apple or the CDN are serving incomplete (and therefore broken) downloads

Let’s look just at the second possible cause vs. the other two as a unit for a moment. Occam’s razor comes to our aid: (paraphrasing) the simplest explanation is most likely the correct one. Is it more likely that some bizarre setting in your router (or ISP) is corrupting binaries only for a particular type of binaries from a particular company in a particular location, or is it more likely that everything works fine (as it does for all other cases) up to the server source, and that it’s Apple or the CDN that is just serving a broken file?

The latter is more likely, and a “simpler” explanation — although it may seem unlikely for reasons I’ll touch on later (the “I broke it” vs “it broke” issue). For the moment, suffice it to say that we automatically assume that Apple would not let this happen, but if you remove that assumption (which is incorrect, again, more later) then things become more clear.

While rate-limiting from your ISP or router, or some other router configuration related exclusively to Apple’s servers is possible, if unlikely, it is pretty much impossible that “normal” content not be affected.

This is a critical point that I mentioned above: None of the reports mention problems navigating to Apple’s site, or downloading any other type of content from Apple (such as trailers, movies, music, etc.). Even more, many if not most of these users are likely to have iPhones, iPads, and iPods, all of which also require signed content downloads, and are served from the same infrastructure, and therefore under the same conditions, as other Apple updates. If all data coming from Apple, including its website, was failing to load, that would be a much simpler (and fundamental) problem, which perhaps could, for example, involve DNS settings.

This leaves us with the first and third options, specifically pointing at Apple and not merely the CDN component. Why? CDN storage being at fault could be a culprit but only if this is a rare, random, and quickly fixed situation. Akamai, and all CDNs, have sophisticated infrastructure that will take out “bad” machines out of rotation quickly. Apple (which as far as I’m aware uses Akamai for many things) no doubt has that type of infrastructure too.

This leaves us with the most probable cause: a repeatable problem that persists for a single location and happens over a span of time, in Apple’s signing process or the file generation/copying that surely follows it. This would point to some bug in custom software on the server side being involved, in which Apple is signing binaries and only randomly corrupting them which ends with a complete file that doesn’t pass signature checks. Given that the problem seems to be limited to some locations consistently, we could also guess that there’s something about those locations by themselves or in interaction with the signing or copying process that is breaking the binaries, perhaps an older part of the infrastructure that is not easily solved because it hasn’t been migrated to new systems for example, or some difference in the environment (network time issues are common) that creates locally valid but globally invalid signatures.

The result of the analysis says that there’s no need to fiddle with settings or call your ISP, since that won’t solve the problem. You can only wait for Apple to fix it (perhaps report it to them) and in the meantime get the binaries from another location, like Dan is doing.

Is this absolutely the right diagnosis? I don’t know, of course. Based on the data I have so far, this seems reasonable, and I do know that if I was faced with this problem, I would either download the software from the location that works, or just sit quietly and fume (yeah, more likely the first option :)). I wouldn’t waste a minute fiddling with router settings. Maybe, if I was feeling somewhat desperate, I would reboot the broadband modem, hoping that by doing that I may get assigned another IP by the ISP and perhaps, maybe get assigned to a slightly different geographic location by Apple where things may be working.

In any case, the specifics of this case aren’t what interests me, what interests me is how solutions that are highly unlikely to affect the true root cause of a problem are accepted, and then spread, online and offline.

Where does cargo-cult troubleshooting come from?

Cargo cult troubleshooting leads to solutions that are closer to “stand on one foot and whistle quietly” than something that actually goes at the root cause of the problem, that is, they don’t actually fix the problem at all. But if so, how do these things get started, and then spread, in the first place?

As for how they get started, the most likely source is variables out of your control. Let’s look at this example. Update fails repeatedly. You start trying to fix it and as long as you keep trying things, the likelihood that (if the problem is in Apple’s end) it will be fixed by them increases significantly. So you do thing #785 and suddenly it works! Only you didn’t fix it. Apple did. Because there’s a giant variable (or more accurately set of variables) that you don’t control on Apple’s side along with all the infrastructure in between, you can never really know what fixed it, especially if you’re trying things for a long enough period of time (say, 1-2 hours at least). Unless you show it is repeatable, which we almost never do. That is: propose that switching feature X breaks Y. Switch X off. Show that Y now works. Switch X on. Show that Y now doesn’t work. Do this three or four times. But that’s not what we usually do. We usually just get something working, are happy that the pain is over, and move on.

There’s an interesting aside to this in terms of why we assume that the problem is on our end first, rather than the other. It’s what I call the “I broke it vs. It’s broken” mindset, of which I’ll say more in another post, but that in essence says that with computer systems we tend to look at ourselves, and what is under out control, as the source of the problem, rather than something else. This is changing slowly in some areas, but in a lot of cases, with software in particular, we don’t blame the software (or in this case, the internet service). We blame ourselves. As opposed to nearly everything else, where we don’t blame ourselves. We say “the car broke down,” not “I broke the car.” We say “The fridge isn’t working properly” as opposed to “I wonder what I did to the fridge that it’s no longer working”. And so on. We tend to think of Google, Apple, and pretty much anyone else as black boxes that function all the time, generally ignoring that these are enormously complex systems run by non-superhuman beings on non-perfect hardware and software. Mistakes are made. Software has bugs. Operational processes get screwed up. That’s how things are, they do the best they can, but nothing’s perfect.

The propagation of a cargo-cult solution

So that’s perhaps a valid theory for how non-solution solutions get started, but then they have to spread. Wouldn’t the fact that hundreds of people are saying in forums “this works” mean that it does? Not necessarily.

First, other people trying to solve the problem are also affected by variables out of their control, and they may experience similar results when trying multiple things in sequence.

Second, the people involved in first trying to identify the solution (let’s call them “Patient Zero”) are usually nerds. Take me, for example. I may have already been tinkering with my equipment, and perhaps in a rare case or two mucking around with, say, the MTU settings or blocking filters leads me to “unbreak” something that I actually broke — but I don’t remember changing. But we tend to forget that most people don’t look at a router settings console in their entire lives. So then I post my “solution” as something to try and someone tries it and it works, it seems to confirm what I said, but either because of external variables, or because of… rebooting.

This is the third way in which “solutions” propagate as valid — just by rebooting. Many if not most of the “solutions” involve rebooting. Rebooting your machine, the router, disconnecting and reconnecting things, reinstalling OSes or firmware. Rebooting/Reinstalling/Powercycling is like the utility knife of Cargo Cult medicines, and one that in many cases in fact works. Low memory, dead sockets hanging out for some reason, and subtle bugs, you name it, there is still a need to reboot devices. So in another small amount of cases, rebooting actually does fix a problem. Myself, as Nerd Patient Zero, know this, and probably was the first thing I tried. But this is not true of everyone, and the least technically sophisticated people are the least likely to just start restarting things for no apparently no reason, because they don’t know that there’s a possible correlation between how long something has been running and possible corruption, misuse in resources that leads to resource starvation, leaks, etc. There’s a reason tech support starts by asking if you have rebooted something. They’re not trying to be obnoxious, they just now that often this is enough to solve state-related problems, and a lot of people don’t think of trying that. The fridge, after all, doesn’t have to be rebooted to be happy, and even the original “Windows Experience” (By which I mean not some fancy Microsoft Marketing term, but “Reboot every day at least, reinstall every 3 months if you want to have a speedy machine”) is not something that normal people remember to do all that often.

The fourth way in which things propagate is through the game of telephone that are Internet forums. A user may think they have a corrupt binary problem but they actually have another problem. Perhaps the download can’t start at all, instead of failing to authenticate. No matter. “This sounds kind of like what I’m seeing”. Even if “kind of like” is not really something that should apply when debugging this type of problem, they don’t know that. They change the setting (and in the process reboot the router, which perhaps was really the problem) and boom, it works! Or — they try a number of things in sequence, then Apple fixes the problem on their side, and presto! In flood the reports of success with the cargo-cult solution.

Finally, and this is perhaps a major reason, we share a strong cultural memory from mechanical and electrical devices in which seemingly ridiculous solutions actually worked. For example, the Apple III was infamously so poorly designed that in some cases when there were issues people were advised to lift the machine an inch or two from the desk and let it fall, which would solve the problem. This was because the action would re-seat the cards, which had been loose. Similarly, in some older TV sets hitting the TV on the side would fix the problem, because of  more “mechanical” reasons, such as loose components, etc.

One of my favorite moments from Armaggeddon is when they are trying to restart the engines of the shuttle go get off the asteroid, and Andropov, the astronaut they picked up at the space station, gets frustrated with the lack of progress, goes down to some kind of engine room where Watts (co-pilot) was frantically and apparently randomly pushing buttons, shoves her to the side and as he shouts “This is how we fix problem in Russian space station!” he  starts banging on some pipes with a wrench. This being a typical Michael Bay movie, the solution works and everyone’s happy ever after. With complex software and hardware systems, however, the equivalent of hitting the equipment with a wrench can’t really solve the problem.

We will only, occasionally, just think it does.

Follow

Get every new post delivered to your Inbox.

Join 367 other followers

%d bloggers like this: