Advertisement

Learning From a Software Glitch at AT

Share

Monday’s glitch-infested shutdown of American Telephone & Telegraph’s long-distance network was one of the best things that could have happened. With the shock of a grand mal fit or a heart attack--but without the long-term pain or debilitation--AT&T;’s software seizure sends exactly the right message to everyone who uses or designs complex networks. That message?

Don’t take us for granted.

Back in October, this column pointed out that the increasingly software-driven nature of our telecommunications networks made bugs inevitable. “The real question,” that column said, “is whether this network evolution will hatch little bugs that merely annoy or big bugs that can wipe out phone service for millions of people at a time.”

Now we know.

But the real issue isn’t bugs and software; it’s how do we design for and manage complexity? That’s the real challenge; the software is just the medium we use to express it. Unfortunately, complexity is the sworn enemy of reliability. Striking the best balance between them is extraordinarily difficult. Compared to a decade ago, our telephone networks are extraordinarily complex things. We want our phones and phone lines to have call waiting, call forwarding, computer modems, perfect instantaneous transmission, fax capabilities and a dial tone the millisecond we pick up the receiver. We want functionality and reliability. Is all that realistic?

Advertisement

Absolutely, yes. What’s unrealistic is the idiotic belief that you don’t have to make any trade-offs at all. There’s a sign you can see in many craft and repair shops. It goes: “Fast, Good, Cheap--Pick any two.”

The question is: what trade-offs are people prepared to make? AT&T; and the phone companies have had a long and excellent tradition of providing high-quality, highly reliable voice and data communications networks. Their network design is optimized around making sure there are no disruptions in service. The catch is that, in the unlikely event that something does go wrong, it’s more difficult to hunt the problem down and solve it quickly. In other words, these networks are not designed to be easily maintained.

“It’s the Titanic effect,” says Gerald M. Weinberg, a Lincoln, Neb.-based software consultant who frequently works with large telecommunications companies. “If you think you’re piloting an unsinkable ship, it’s ‘Damn the icebergs, full speed ahead!’ ”

While Weinberg stresses that he has enormous respect for AT&T;’s software tradition and skills, he points out that one of the most important qualities for truly great systems design is humility. With software, the issue isn’t whether there will be bugs--the issue is how do you hunt them down and squash them when they surface?

Monday’s events tell us that we need a more pragmatic design ethic for software--an ethic that stresses ease of maintenance as much as reliability. We can’t trust complex software to always be reliable--but if we can identify and fix an unanticipated software glitch in less than a minute, the network won’t be down for hours. AT&T; rightfully prides itself for building a network that is “self-diagnostic” and “self-healing” on many levels. But the reality is that, as the complex interactions of sophisticated software have multiplied, the company’s traditional methods to assure network integrity are no longer good enough.

Monday wasn’t an anomaly, and MCI and US Sprint would be well advised not to snicker to AT&T;’s customers. We have entered a time when consumers, corporations and systems designers will take trade-offs seriously instead of treating them like an intellectual exercise. The emerging emphasis on software maintenance is a clear signal that complexity has finally caught up with us.

Advertisement

“There are three levels of software complexity,” Weinberg warns. “There’s the complexity of the job itself, the complexity of the way you do the job and then there’s the complexity of the environment in which you’re doing the job.” Because all three of those dimensions are rising in complexity, says Weinberg, designing for maintainability is the best way to assure that reliability is even possible.

This is particularly true for the local telephone switch network--the computer that handles everything from call waiting and call forwarding to the dial tone. “The first objective of our network is to get it up and running,” says Bruce D. Gesner, Pacific Bell’s executive director for technology introduction and support. “Now it’s time to pay more and more attention to maintainability. We will be focusing more and more on designing systems that allow us to monitor better and allow us to catch problems before they escalate into disasters.”

Indeed, Gesner is overseeing Pacific Bell’s planned introduction of the “out of band” signaling system--the system with the software that AT&T; has identified as being responsible for its network trauma. “We are working aggressively to develop maintenance techniques to manage that,” he promises.

While it’s highly unlikely that a software bug lurking in the local loop will kill off your dial tone, the increased levels of functionality and complexity in your local switch makes it more likely that software-induced hiccups and burps will make service a little less reliable. On the other hand, potential problems should be identified, captured and destroyed more quickly.

The beauty of Monday’s mess is that we can no longer afford to take the richness and complexity of our networks for granted. When you stop taking something for granted, there’s a tremendous opportunity for generating new ideas, new approaches and unusual innovations. What happened Monday isn’t a sign that something’s rotten in the state of software; it’s a sign that it’s time to treat our complex systems with the respect and care they deserve.

Advertisement