Advertisement

AT&T; Fiasco: Tense Fight With Haywire Technology

Share
TIMES STAFF WRITER

As soon as Jim Nelson glanced at the bank of 75 wide-screen television monitors on Monday at 2:25 p.m. and saw the screaming red lights, he knew the world’s oldest and best-known long-distance communications network had suffered nothing short of a meltdown.

The 48-year-old manager of American Telephone & Telegraph Co.’s network operations facility in Bedminster, N. J., had never seen the screens quite this colorful and animated before. His first wishful thought was that it was the television system--which tracks the pulse of the network--that had broken down, not the heretofore reliable network.

But, within 15 minutes, the worst fears of everyone at AT&T; and across America had come true: The nation’s primary communications network, which typically handles more than 80 million calls a day, had collapsed, allowing only half of the routine voice and data traffic to make it through unimpeded.

Advertisement

Over the next nine frenzied and frantic hours, hundreds of engineers across the country scrambled to uncover the problem and correct it. Meanwhile, Nelson recalls, horrified AT&T; executives hung around the Bedminster operations command center “like a bunch of expectant fathers outside a maternity ward.”

No wonder. The stakes were enormous. With each hour that went by without the problem being found and corrected, phone users unable to complete or receive long-distance calls lost millions of dollars of business--and patience with AT&T.;

Finally, the problem was isolated and solved. But not before it became the worst breakdown ever for the nation’s phone system as well as a monumental embarrassment for AT&T;: The telephone giant has been trumpeting reliability as the cornerstone of its marketing efforts to regain business lost in recent years to such upstart rivals as MCI Communications and US Sprint.

“It was an emotional roller coaster for all of us,” recalls Nelson, who helped lead the team of engineers scrambling to fix the network. “At least three times we thought we had the problem figured out. We’d try our fixes, hold our breath and then get upset when they didn’t work. What was never supposed to happen to our system was actually happening right before our eyes.”

Despite the tremendous inconvenience it imposed, technology experts say that Monday’s long-distance fiasco may actually help Americans understand that we are living in an increasingly complex world where simple, quick solutions don’t apply when technology goes on the fritz.

“We all love magic, but there’s lots of hard work with technology,” said Gerald Weinberg, a technology systems consultant in Lincoln, Neb. “As systems become more complex, it’s harder to diagnose problems and then locate them. Even when they’re found, it’s harder to fix the problems because the systems are more interconnected.”

Advertisement

But last Monday morning there was no reason to believe that this would be the day long-distance technology would be challenged. In fact, because most government offices and schools and some businesses were closed for the Dr. Martin Luther King Jr. birthday holiday, the day figured to be less taxing than usual on the nation’s telephone networks.

Work was progressing Monday on the rapid installation of a new software program to handle traffic among AT&T;’s 110 computer centers, which route telephone calls along its thousands of miles of telephone lines nationwide. More than half the centers had been equipped with the new “System 7” signaling program. The remaining sites, where “System 6” was still in use, were being readied for the changeover.

At 2:20 p.m., the 75-set television display on the wall over Nelson’s desk flashed its routine update, just as it does every five minutes of every hour. The display, programmed to show only problems by their location on the continent, was normal.

And then, Nelson recalls, “in a matter of minutes there was a jumble of lines showing that circuits were busy all across the nation.”

Like their customers, AT&T; engineers in Bedminster might have been stymied in their efforts to correct the problem if they had not had immediate access to telephone lines entirely separate from the company’s long-distance network.

Using those lines, Nelson and his operation engineers immediately called colleagues at an AT&T; complex in Lisle, Ill., as well as researchers at Bell Laboratories’ facilities in Columbus, Ohio, and Indian Hills, Ill. Together, engineers at the three sites monitored the network’s traffic. Within 15 minutes, Nelson says, they had generally concluded that “we had something we had never seen before.”

Advertisement

Figuring out what that something was, however, was a different and far more vexing problem.

The engineers were aided in their search by information generated by the 110 computerized telephone switching centers across the nation. Those reports tell operators how many calls are going through the centers, how long the calls are held before being allowed to proceed and where the calls are being routed.

Because the centers were all reporting, the engineers knew their computers were in working order. After quickly ruling out hardware problems, the engineers turned their attention to the instruction packages directing the operations of the computers: the software.

“The problem here is that you never know how close you are to fixing a software problem,” Nelson says. “It could be five minutes or five hours. It’s not like a broken wire. With software, the problems are either extremely simple or extremely difficult.”

The computer reports showed also that traffic was equally slowed across the nation, indicating that the problem was affecting the entire network, not just a single geographical area. This led engineers to conclude that the problem was probably in the software that was handling the traffic among the 110 switching centers.

Finally, engineers observed that about half the calls were getting through, forcing them to concentrate on the reasons why the remaining half were held up.

Advertisement

Over the next six hours, engineers meeting in teams in Ohio, Illinois and New Jersey studied the data, proposed possible scenarios and offered potential solutions. At least three times, engineers wrote a new software code that was loaded into the computers--the Western region computers were selected to be the guinea pig test sites--in the hope that the answer had been found.

Finally, engineers looked at the new software that was being installed throughout the network, the “System 7” signaling program that was gradually being phased into the 110 switching centers.

Six hours into the crisis, the engineers had found the source of their problem.

The culprit was the switch in an AT&T; facility on Broadway in Manhattan, a room about the size of a two-car garage and filled with gray steel-encased computers capable of handling up to 800,000 telephone calls every hour of every day.

According to Nelson, a seemingly minor problem with a piece of equipment at the facility triggered events that ultimately allowed the language operating the computer to get totally out of sync.

“It was as though the computer that was supposed to be talking German suddenly started speaking in Italian,” he explained.

But the worst of it was that, when the remaining 109 computers started getting the equivalent of Italian from the Manhattan computer, each shut down momentarily, as the computers are programmed to do in the event of such glitches.

Advertisement

Suddenly, the system, hit by so many shut-offs, acted as though it was overloaded, sending out busy signals and overuse messages.

Nelson recalls that engineers believed that they had accurately diagnosed the problem and had fashioned a temporary solution by 10 p.m. (EST) Monday. Now the challenge was to introduce the solution without bringing down the entire network.

Engineers typed a string of software code into a computer terminal in New Jersey that would allow the 110 switching computers to ignore the “Italian chatter” and slowly “fed” the code into the 110 switching facilities.

By 11 or 11:30 p.m. the code had been installed on every computer and the network was running as well as it had nine hours earlier. Still, the network was far from secure.

Over the next 18 hours, engineers--most working with little or no sleep--attempted to recreate the scenario they believe occurred at the Broadway facility in Manhattan. By 4 p.m. Tuesday, the recreation had been made and engineers reliably believed that they had identified the exact chain of events that triggered the Italian chatter and, ultimately, the network’s collapse.

Tuesday night, the culprit code--which is actually contained in all of AT&T;’s “System 7” software--was removed throughout the network. But because that part of the code, when it operates correctly, allows the network to perform important housekeeping tasks, AT&T; engineers plan to write a replacement in the near future, Nelson said.

Advertisement

Its introduction, Nelson said, will be slow and carefully monitored.

Advertisement