EE Times article on domino effect says "threat is increasing", but "don't panic"

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

From the electronic engineering journal EE Times http://www.eetimes.com/story/industry/systems_and_software_news/OEG19990709S0029

Experts mull potential domino effect of system failures

By Stan Runyon and Craig Matsumoto EE Times (07/09/99, 4:06 p.m. EDT)

NEW YORK  Are our systems reliable? Given the pervasive dependence on electronic systems packed with devices too complex to test down to each transistor, it's a reasonable, if provocative, question. Consider the case of the chip that could have brought down the Internet.

It happened at a New York Internet point-of-presence  a room stuffed with dozens of network routers. One chip burned out on one board; an engineer put the fire out without incident. But the smoke blown from cooling fans in the routers began drifting into the room and curling up toward the smoke alarms.

Because automatic fire-suppression systems cannot use halogen chemicals, the room was equipped with sprinkler systems. Had the smoke been sufficient to set off the alarms and trigger the sprinklers, "it would have taken out every box in the building. It would have taken down the entire U.S. Internet," said engineer Hugh Duffy at Failure Analysis Associates Inc., which investigated the mishap.

The intertwining of systems of all sorts calls for consideration of the ripple effect of any given change or failure, Duffy warned. "It used to be that if a board failed, O.K., so your TV didn't work anymore," he said. But increasingly, "you have to walk your way through all the consequences of [your] decisions."

Some experts, including Duffy himself, cite credible evidence that systems are becoming more reliable relative to their complexity. While acknowledging that systems-on-chip represent a quantum leap in design intricacy, they note that fewer blocks are being connected to the outside  and it is in the interconnections, they argue, that physical problems most often surface.

Failures decline

"The 'terrible truth' is that failure rates are going down, not up," Duffy said. "People got more experienced at making chips, so they are more reliable."

But the world population's increasing reliance on systems  and the systems' increasing reliance on one another  breeds vulnerability. "With the rising complexity of global systems such as the Internet and power grids, the threat and impact of failures is increasing," warned Donald A. Norman, a consultant and author of numerous books on design. "We are getting to the point where we will see complex systems problems the likes of which we have never seen before, and we lack the scientific background to understand them."

Indeed, experts say it is becoming increasingly difficult to gauge the reliability of large-scale systems. The Web, for example, defies analysis because it is a hybrid of the traditional circuit-switched telephone network and today's emerging data, optical and cable nets  a complex system of interrelated systems.

The Asian flu erased all doubt that global economies are interlocked. But beyond economic institutions, technology itself has intertwined the nations of the world in an interdependent web of critical technologies.

So just how fragile is that web? What would it take to "take down" the planet or a particular portion of its critical enterprises?

"Failure is a normal part of any human-made system, a part of life," said Norman. "The human is part of the system. That's not a novel concept, but it's still novel in many product-development cycles.

"I hear it from many EEs: They are working on something that they say is at such a low level that it doesn't impact anyone. As long as [their subsystem] works perfectly, their assumption is OK," Norman said. "But what happens when it fails?"

Norman, a former head of Apple's Advanced Technology Group, sits on the U.S. Government's Computer Science Telecommunications Board, which reports on safety and reliability. The board's object is to address growing concerns over national security, especially the exposure of electronic systems to failure by accident or tampering.

"We can put out new computers faster than we can develop security for them," Norman cautioned. "The whole system is very susceptible at all levels." Further, "once you adopt an infrastructure, even if it is extremely vulnerable, it is very difficult to change it."

The U.S. government is taking notice of the problem: The Clinton administration wants to invest $485 million in fiscal 2000 to research the breadth of the nation's vulnerability to reliability issues. While much of the funding would go toward examining cyber-terrorism, the government is devoting increased attention to network reliability per se.

One approach is the development of neutral test beds that can be used to test the reliability of network components and systems. An interagency group coordinated by the White House Office of Science and Technology Policy also wants to devote more resources to understanding interdependencies among communications, financial, transportation and other networks.

"We just don't have a good, science-based understanding of these [network] interdependencies," a U.S. official acknowledged.

Designing systems around diversity can amplify reliability in hardware and software systems, states a report published in April by the National Academy of Sciences and edited by Fred Schneider, a professor of computer science at Cornell University and chairman of the National Research Council's Committee on Information Systems Trustworthiness. As in nature, some members of a diverse population will survive an attack. "This principle can also be applied for implementing fault tolerance and certain security properties, two key dimensions of trustworthiness," concludes the report, "Trust in Cyberspace."

Schneider said another problem is the lack of business incentives needed to build more-reliable components and systems. Citing recently approved year-2000 liability legislation, Schneider predicted that system vendors will seek to minimize their chances of being sued by fielding more-reliable systems.

"We think it's going to take five to 10 years to get the answers," he said.

Another hot spot is the use of commercial, off-the-shelf (COTS) technologies for high-reliability applications. John East, president of Actel Corp. (Sunnyvale, Calif.), noted that about a quarter of Actel's business is in hi-rel products for satellite, aerospace and military apps. Citing recent satellite failures, East said he sees the seeds of a movement away from the COTS approach.

"The world has overreacted in its embrace of COTS and will start moving back in the other direction soon," toward the use of components that have been specially tested for high reliability in critical applications that can support the added cost, East said. "People were beginning to think that quality is free."

But East acknowledged that the increasing complexity of the technology leads to a dependence on increasingly automated processes, if not on off-the-shelf parts. "This business is so arcane," he mused. "I've been in it forever, and there are people working for me now who [perform functions so esoteric that] I don't have a clue what they are doing. I am lucky if I can [even] name it."

Test gets tougher

East suspects that "testing circuits has become harder than designing them these days. We have to take a look at the test flows." And he admitted that "you cannot test everything: Much is done through simulation, and you can only get close."

But even those who analyze failures for a living note that while designing systems-on-chip seems impossibly complex, the finished products may contain fewer physical defects today than in the past. "It really is counter-intuitive," said engineer Richard Blanchard, a colleague of Duffy's at Failure Analysis Associates, a unit of Exponent Inc. "What happens is you make a system 10 times as complex as it used to be, but it might be [only] twice as likely to fail, not 10 times more likely."

Blanchard and Duffy are catastrophe seekers; they hunt down the causes of chip and board failures  not logic or programming quirks, but tangible, physical problems, mainly short-circuits and fires.

They also test prototype hardware for potential failures  the "fun" part of the job. Taking a visitor on a tour of the lab, Duffy pointed to stations at which computers are short-circuited, overheated or even blowtorched. He specializes in analyzing circuit boards, which often come to his desk with enormous chunks taken out of the corners, almost always the result of a short-circuit.

Boards are not failing at a greater rate, Duffy noted, but their increased complexity poses the potential for more problems.

Today's boards are built in horizontal layers, including one plane that represents ground and another connected to the power supply. Pins atop the board connect to both planes through vertical tunnels known as vias. That means that vias connected to ground will pass through the power plane, or vice versa, depending on the board layout. And the pins for power and ground are invariably right next to one another on a chip, to keep timing as fast as possible. The proximity of power and ground vias opens the door to problems.

"Before, if a component failed and burned through the board, you got a hole in the board. These days, if a component burns through the board, you have a short-circuit inside the board," Duffy said.

It would be ideal make to critical vias wider or spread them out more, but the age of computer-driven designs doesn't lightly suffer such solutions, he said. "A computer program's complicated enough without telling them to make it wider. Nobody goes through and says there are some instances where there are more dangers than others."

Fires are easy to diagnose. What's tougher is when a board stops working because of a blocked or inadequate via, Duffy said. Vias can be examined only by taking a cross-section of the board and slicing it in thin layers, tracing the via's vertical path until a problem is found.

Even Duffy readily acknowledges that hardware's track record is improving, thanks to the advent of the system-on-chip and to clean-room advances that keep chips and boards miraculously free of contaminants. But he's not letting down his guard.

In the 1970s, Duffy recalled, statisticians calculated that if all electricity were cut off, 30 percent of the population would be dead within a month. Within a year, 80 percent would be dead.

"We get up in the morning, and we won't make it to the end of the day unless all these systems are up and running. It's truly scary. And that was back in the 1970s  hell, now it's worse," Duffy said. He has no magic formula for coping with the potential catastrophes simmering under all our system complexity.

But he does have some advice: "Realize that disaster is possible, but don't panic at the idea."

-- a (a@a.a), July 15, 1999

Answers

"a", here is the link to:

A Circle of Dominoes article that has been around for awhile.

Ray

-- Ray (ray@totacc.com), July 16, 1999.


TOP!

-- R (riversoma@aol.com), July 16, 1999.

And why can't they use Halon fire suppression systems? Why because they are a threat to the ozone layer. HAR HAR HAR HAR HAR!!!!!

-- kozak (kozak@formerusaf.guv), July 16, 1999.

You are walking down the street, a law-abiding citizen (thus unarmed). A mugger (non-law abiding, pointing a gun at you) approaches. "Threat is increasing...but don't panic." (Big Brother will help -- one of his officially sanctioned goons will draw your body outline chalk lines.)

-- A (A@AisA.com), July 16, 1999.

Moderation questions? read the FAQ