►◄ Reverse Zone
 

Home

About
Reverse Zone, weblog on urban planning, sustainability, and technology.

Martin Laplante

Subscribe
to an RSS feed of this weblog.

Links
A few favourite links.

Recent posts

 2008/04
 2008/03
 2008/02
 2008/01
 2007/12
 2007
 2006
 2005
 Complete List of Posts

Technorati Profile
Add to Technorati Favorites

Real Estate Top Blogs

Sustainability Web Ring 
control panel

     
Wed, 22 Aug 2007

The Skype is Falling! The Skype is Falling!

Last week's Skype outage shows the fragility of peer-to-peer networks once they reach a certain size. To be fair, not all peer-to-peer networks. Skype has based its network's invulnerability on a set of horcruxes (what, don't you read?) known as supernodes, which it has hidden among its customers's computers.

Skype does not provide its own servers to perform most of its critical services. Instead, whenever it notices that one of its customers has lots of bandwidth and no pesky firewall, that computer is pressed into service as a critical server on the Skype network. It sets up a little sign on the internet saying it works for Skype and other computers wishing to make Skype calls should connect through it.

As it happens Windows machines with automatic Windows Update are prominent among the machines that can be secretly promoted to supernode because of their large bandwidth budget and permissive security. So when Microsoft made a lot of those machines reboot at the same time on August 16-17, Skype eventually went down, unable to find new "volunteer" servers quickly enough. I think that the secrecy and unwitting outsourcing of the supernode role is a throwback to the Skype developers' Kazaa past, where they thought this was a good defense against being shut down by the authorities.

Now wouldn't it be better if the supernodes were conscious of the fact that they were being used that way? They could maybe take some steps to make sure that Skype wasn't caught short every time they turn off their computer, maybe agreeing to giving Skype a 60-second heads-up to start moving customers to a different supernode temporarily.

More telling than the network failure, though, was the fact that it took several days for the network to recover, long after the root cause had disappeared. That shows singular lack of foresight in the design. I've had that happen to my servers before. When running a particularly popular internet service, I had to restart some server components at one point. Users of the service, seeing a delay in response time, of course just hit reload over and over again, queueing a huge number of requests. But while a service is being restored, not all components come back instantly. The database checks itself out, new database connections are set up, errors are logged, and all sorts of components are all yelling at each other if you could just hang on a few seconds I'm busy here. The backlog of requests gets bigger and bigger until the whole thing comes crashing down again, not neatly at all but each with its own timeouts and each filling out its error logs with cryptic and misleading messages.

Thus with the Skype network. It had been brought up to full volume over a period of years. It doesn't have a good mechanism for restarting from zero with everyone waiting. This isn't Windows, where reboots are frequent and the system knows in what order things have to be restored, and users know when to quietly drum their fingers. Since it's distributed, you can't go over the web's intercom and say would everyone please stop trying to connect for 5 minutes, OK?

If Skype is going to become respectable, with its billions from eBay, it's going to have to be more open about relying on its users for its service. Maybe give them a freebie in exchange for a service level agreement. Maybe even provide some of its own superdupernodes on its own hardware. It can't assume that people won't have firewall rules to keep them out. Skype also has to learn to compartmentalize damage to its network.

Tags:

[] permanent link Comments: 0