AT&T;'s Data Crash Raises Some Concerns

By JUBE SHIVER JR

April 27, 1998 12 AM PT

WASHINGTON —

Boosters of the Internet and other widely used public data networks found out this month how vulnerable the electronic pipelines are when an AT&T; Corp. maintenance blunder sent the information superhighway into gridlock.

Attempting to update software in a high-speed network, AT&T; technicians were stung by a software bug on the afternoon of April 13. In hindsight, company officials admitted their procedures were “inadequate.”

The resulting outage lasted a full day and wasn’t completely diagnosed for more than a week. It prevented untold numbers of consumers from completing credit card purchases, halted withdrawals at many Wells Fargo Bank automated teller machines and even caused some cash registers to malfunction at coffee purveyor Starbucks.

Nevertheless, high-speed data networks such as AT&T;’s have become an increasingly popular way for banks, airlines, retailers and others businesses to relay customer information and other data to their branch offices.

Offering rates that are up to 40% cheaper than leased private lines, the market for these so-called frame-relay networks has nearly doubled to $2.2 billion in the last 18 months.

But the clamor for the convenience and economy of the public data lines is quickly being replaced with new concern over the reliability of the transmission technology, which can relay data at speeds up to 45 megabytes per second--more than 500 times faster than a good personal computer modem. The higher-cost private lines have proven less susceptible to software glitches.

While outages in public data networks are taken in stride by some organizations as a cost of doing business, others fret that AT&T;’s frame-relay network outage is symbolic of looser quality standards that have come to typify cutting-edge technology.

In the days of Ma Bell, ordinary telephone service stood as the gold standard of communications service, mostly because the network was designed only to carry telephone calls.

But in today’s more competitive world of data networking, manufacturers are in an almost constant arms race to build complex systems that relay voice, video and data over the same wires at faster speeds.

Manufacturers such as Cisco Systems Inc., which made the errant switch on AT&T;’s network, constantly make hardware and software upgrades. A few are sometimes plagued by the same kinds of bugs that have infuriated PC users for years.

“They are constantly changing both the hardware and the software,” said Ron Jeffries of Jeffries Research, an Arroyo Grande, Calif.-based consulting firm. “It’s just a fact of life that these networks are getting bigger and more complex. I think it’s telling that two [technology] giants had this outage and it took them more than 20 hours to fix it.”

In the wake of the AT&T; outage, experts say the crucial question is whether big business, after moving in droves from private to public data networks, will be spooked by the apparent inability of the world’s biggest communications company to quickly fix a network outage.

“This shows that modern data networks are susceptible to failures no matter who runs them,” said Amir Atai, director of network traffic and performance at Bellcore, a leading telephone industry research and development company. “The important thing is that new software releases have to be tested . . . before they go to the network.”

But that’s easier said than done.

AT&T; was tripped up while modifying a program that controlled a network switch. Officials said that while maintenance crews were making the software upgrade, a previously unknown software flaw caused the switch to send out a cascading wave of false messages to the other 145 switches on AT&T;’s data network, overloading the system and shutting it down.

“This one particular procedure on one switch, coupled with that software flaw, started a looping of messages and, hence, all of the switches became overloaded,” said Frank Ianna, executive vice president of AT&T;’s network and computing services. “We will certainly not do that procedure again on that particular switch.”

Nevertheless, predicting where the next glitch may occur is next to impossible. In the April 13 incident, AT&T; engineers thought the physical isolation of the switch would have localized any problem.

Experts say AT&T;’s biggest mistake was performing the delicate operation on a Monday afternoon, during the equivalent of rush hour on the information superhighway.

Experts say the beginning of the week is often a time when a trickle of weekend data traffic escalates into a flood of messages.

“It’s like trying to change a tire on Brooklyn Bridge during rush hour,” said John Nitzke, a senior analyst at the Cambridge, Mass., consulting firm Forrester Research. “They felt comfortable enough to do it . . . but I think it was ill-advised.”

But Ianna responded that since the switch was not handling customer traffic, there was no reason to suspect that it might trigger problems with active switches.

AT&T; officials declined to say how much financial damage the outage caused, but they are not charging customers during the two weeks its has taken the company to diagnose and fix the problem. Analysts estimated that gesture could cost AT&T; about $40 million in lost revenue from its $900-million-a-year frame-relay business.

Times staff writer Jube Shiver Jr. can be reached via e-mail at jube.shiver@latimes.com

AT&T;’s Data Crash Raises Some Concerns

More to Read

More From the Los Angeles Times