Advertisement

Prevention of Online Crashes Is No Easy Fix

Share
TIMES STAFF WRITER

Users have heard it all before.

Online auctioneer EBay promises again and again that it’s going to address the system failures that have resulted in numerous blackouts, including one in June that lasted 22 hours.

EBay officials recently said that “in the next few weeks” it should have a system up and running that will limit down time to about an hour--most of the time.

Yet for those who make their living on fast-moving businesses, such as EBay, E-Trade or Schwab, a crash that lasts even an hour can be catastrophic. The confusion and frustration of an outage could put a corporate auction customer out of business, and a plunging stock could bankrupt an online trader who can’t trade.

Advertisement

With the rapid advances in technology and billions of dollars at their disposal, why can’t companies keep their systems from crashing?

For the most part, Web crashes are the result of human error, flawed predictions and the frantic pace of trying to keep up with online competitors.

“You can have the best technology on the planet, but half of the failures we analyze are human error,” said Jeremy Burton, a vice president at Oracle, whose databases are used by many Internet companies.

Crashes: No End in Sight

Although broadband and other, better ways of handling massive computer traffic are emerging, companies also are using more untested services that strain the technology. That means that crashes may become even more frequent in the years ahead.

“We used to have six months of development on a product and three months of testing. We don’t live that way any more,” said Debra Chrapraty, until recently chief information officer at E-Trade. “In Internet time, people get sloppy.”

E-Trade and its electronic rivals sometimes crash because of human haste, Chrapraty says. For instance, a software upgrade might be installed on a company’s computer system while it still has bugs in it.

Advertisement

“There are so many people in the gold rush to get their applications online first,” said Mark Bauhaus, who heads a division of Sun Microsystems that consults with Internet companies.

And analysts expect the holiday season will bring a rush of new Web sites--and crashes.

Even Amazon.com, considered among the most reliable major electronic merchants, has had three outages last month, including one blackout that lasted half an hour.

“It’s a large and complex system, and every once in a while it will hiccup,” Amazon spokesman Bill Curry said. He declined to identify what caused the latest crash beyond saying that it “was related to scheduled maintenance.”

The most common cause of failure is simple: underestimating demand.

Three years ago, leading Internet service provider America Online introduced flat monthly rates for Web surfing and was swamped with people dialing in.

After angering millions of customers with constant busy signals, AOL was sued by state attorneys general and was ultimately forced to offer massive refunds and to upgrade its systems to handle the greater demand.

In February, many computer users couldn’t access a heavily promoted Webcast of a fashion show by lingerie retailer Victoria’s Secret. The company had advertised the Webcast in commercials during the Super Bowl, yet had anticipated less than a third of the unprecedented 1.5 million viewers who were able to log in.

Advertisement

Lack of bandwidth and server capacity were also to blame. “When we said we needed the capacity to handle an order of magnitude more [volume] than the previous biggest-ever events, nobody really listened,” said Tim Plzak, director of advanced technologies at Limited, the parent retailing company.

The company is making no promises about its planned second Webcast this May.

In October, when Encyclopaedia Britannica announced it would put content from its 32 volumes online for free, it badly misjudged how much interest the move would generate.

The Chicago-based company had planned for several million visitors to its site the first day.

“It went off the charts,” Senior Vice President Kent Devereaux said. The total number of visits climbed above 10 million, practically immobilizing the surfers who could reach the site, before Britannica pulled the plug. It was two weeks before the company’s pages came back up.

When System Design Is Inadequate

A more technical problem has to do with where on computer networks most information is stored. Too often, it’s clustered in the middle, away from the users. That makes for traffic jams.

“If Hollywood worked the way the Internet works, everyone would have to fly to Hollywood to see a movie,” said Kevin Brown, a spokesman for e-commerce contractor Inktomi Corp.

Advertisement

One-time flops frustrate users, but they aren’t likely to be fatal for the companies. After all, battle-scarred AOL is still the largest Internet service provider. More devastating for Net companies are systems that aren’t designed adequately in the beginning.

EBay, which was incorporated just three years ago and now processes more than 650 auction bids per minute, is a perfect example.

Consumer searches and bids go from millions of personal computers, through the individuals’ Internet service providers, to Cisco routers and 200 Windows NT servers at EBay, and then to one of several Starfire mainframes from Sun Microsystems.

Records are kept on Oracle database software. Automatic e-mails to bidders and sellers complicate the matter further.

EBay’s chief scientist Michael Wilson designed the system architecture when the company was tiny, replacing the computers that were in founder Pierre Omidyar’s house.

It hasn’t changed in a fundamental way since, Wilson said.

It’s the same situation at many firms that grow too fast, experts said.

“They don’t plan for success. There’s not enough blocking and tackling,” Sun’s Bauhaus said. “Sometimes it’s like patching the Titanic: It’s better to take the steel out of it and build a new ship.”

Advertisement

EBay has done many things right, starting with eliminating many “single points of failure,” spots that, if they crash, knock out the whole system.

EBay now uses many Internet service providers, for example, greatly reducing the odds of a failure at that point of the system. “If any big ones go down, traffic automatically rolls over,” Wilson said.

And the company is working on a backup system to duplicate all the auction data more rapidly in case one database fails.

But sometimes even the basics can be a problem.

Three years ago, an outage at Stanford University knocked out BBN Planet, a major routing hub for Internet traffic, and workers at hundreds of companies, including Sun, Apple Computer and Hewlett-Packard, lost Internet access for as long as 24 hours.

The apparent culprits: rats that had worked their way into a 12,000-volt switch and were later found fried to a crisp.

That sort of crash also could have been avoided through a commitment of money and time.

Not every company, however, has the money, or is willing to take the time. Frequently it’s a matter of priorities, not the limits of technology.

Advertisement

“There are techniques that are capable of producing systems that are more reliable than we have now,” said Jim McInerney, executive vice president of Exodus Communications, which helps e-commerce companies.

Exodus runs servers that have their own alternative electrical power and guards, and the company has contingency plans for rare problems such as floods. That’s so “you can at least eliminate the facilities and the network as a fundamental issue you have to worry about,” McInerney said.

‘Challenge of Replication’

Net veterans believe that competition to retain customers will eventually force e-commerce sites to make sure they are always available. For now, though, there are some thorny problems.

For example, a lot of EBay’s trouble, and that of the stock-trading sites, is that core information--such as the highest pending bid for an item--is so time-sensitive that it in effect can’t be two places at once.

“The challenge of replication is something we’re going to be addressing in the next 12 to 18 months,” Wilson said.

To EBay users, that sounds familiar: Glitches are an expected part of innovation. But they seem more intractable than ever as the pace of change increases.

Advertisement

Web companies are going to keep racing to be first with new types of businesses. And they all want to be first to conduct those businesses over smart phones, pagers and other new devices.

The combined strain on people, procedures and equipment is going to be too much, the experts believe.

“It’s going to get worse before it gets better,” Bauhuas said.

*

Times staff writer Joseph Menn can be reached at joseph.menn@latimes.com.

(BEGIN TEXT OF INFOBOX / INFOGRAPHIC)

Where It Goes

EBay users rarely think about the bidding process--until the site crashes. Behind the scenes, the online auctioneer has a number of safeguards that rely increasingly on duplicated, or mirrored, technologies in case one piece of machinery or software fails. But the information must still pass through many different companies and types of equipment for everything to work properly.

Bidder at home registers and submits an electronic bid from a personal computer.

The bid travels from the consumer’s Internet service provider, through switches and routers, to the ISP company’s servers.

The bid is sent through the Internet backbone.

The bid travels to one of EBay’s ISPs, most likely Sprint or UUNet, and through pipes to EBay.

Advertisement

The bid passes through EBay’s Cisco switches and routers.

The information reaches one of about 200 front-line Compaq servers running on Windows NT. The servers are mirrored, so that if any one fails, the others pick up the slack.

The bid is passed along to one of Sun Microsystems Starfire servers, named Bull and Bear, that mirror each other.

The bid is added to two information-storage databases running Oracle software, where it is matched with the seller’s information.

The information flow is reversed back out of EBay, into e-mails sent to both the seller and potential buyers who are outbid. Confirmation is also sent to the bidder.

From Bull, the bid amount and other details are sent to another Starfire server, called Anaconda, and recorded on mirrored data storage disks.

Buyer submits bid

Seller submits information

Sprint

UUNet

Cisco mirrored switches/ routers

Many mirrored front-end servers

Market cluster

Bear

Bull

Planned new Starfire server

Data log mirrored disks

Anaconda

Oracle shared disks (mirrored disk dual databases)

EBay is planning to add another Starfire attached to the final data disks, mirroring Anaconda.

Advertisement

Sources: Times staff, EBay

Advertisement