Fault tolerance: dominoes or a net?

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Folks prediction TEOTWAWKI often argue a domino chain : A crashes B and C which crash D to G which crash.... This is very far from the truth (if it was not, the first fault would crash everything!) though interconnectedness is of course a reason for concern.

A recent post on Gary North's forum about fault tolerance got closer to reality, but still claimed that society is less than 10% fault tolerant. This got me thinking (not least because I'm sure WWII caused more than 10% failure!)

A simple example of an interconnected net is a sheet of squared paper. What percentage of the crossovers obliterated at random does it take before you can no longer find a route from anywhere to anywhere else?

It's a lot more than 10% unless you impose a pattern on the blockages. I don't know the answer (and in any case it depends critically on which intersections get obliterated towards the end), but it's quite large: maybe near 50%. (I could look it up, but there's no point being very precise about something this far abstracted. Just drawing it for yourself is more informative).

This is a possible model of a telecommunications grid, where every exchange is connected to four others (probably an underestimate) and consists of only one switch (possibly also an underestimate). Any telecomms systems engineers like to comment on reality?

As for a whole economy, this is far more complex, not least because the underlying rules aren't even fixed. However, if a net is a more accurate reflection of things than a chain, the fault tolerance is a lot higher than GN's 10%. Anyone know if any econometric (or possibly military) studies of this have been made? Also a net model predicts that the effect of multiple failures will be a reduction of thruput (ie a recession or depression) rather than a complete collapse, up to a fairly large critical figure.

Another attempt by myself to get a handle on the big picture, but like Cory I still can't see through the curtain.

-- Nigel Arnot (nra@maxwell.ph.kcl.ac.uk), February 04, 1999

Answers

Rather than thinking that each crash causes the next 1 - 5 to crash, let's separate the dominoes a bit and suggest that some crash the next ones, some miss the next ones and crash alone, and some simply cause the next ones to totter, or become unstable, so that the next gust of wind or someone walking down the hall outside MIGHT cause it to fall.

In this analogy, the tottering dominoes represent the companies which have their efficiency degraded by the crash of another company. This does not include the very real possibility that a degraded efficiency company CAN cause another company (up or down the supply/product stream) to crash.

Also consider that the crash may be either way, and that a tottering domino may cause another to totter, and this may come back up/down the chain and cause a reflex collision and a crash going the other way in the stream. The web becomes a bit more complex, and we start to have to look at fault tolerance in less absolute terms. Which, instead of helping, makes the "calculus" even more difficult, and a sub-optimal outcome (medicine showing) more likely, rather than less.

In re your comment that we had a higher fault level in WWII, I would submit that, in non-automated systems, and in non-automated relationships like you saw in the late 30's and 40's, as well as the general "can-do" or "make do, make it do, or just do without" mentality, increases the fault tolerance tremendously. Simply the more manual the system or inter-relationship, the more fault tolerant due to the fact that a human is typically several orders of magnitude more fault tolerant than a (typically thought out) programmed decision tree. The human knows (or can creatively design on the fly) work-arounds, the decision tree does (or can) not.

Thus, the fault tolerance of the "macro-system" in 1935, is incredibly greater than the fault tolerance in 1999.

Just my NTBH $.02!

Chuck

-- Chuck, night driver (rienzoo@en.com), February 04, 1999.


The model used to be comparable to a net - but with so much more global and national consolidation and standardization - particularly in electronics and after "lost" small machining and foundry shutdowns - now it more like a hammock strung together with only four or five cross wires.

You can't "sit" in it - there is too much spce between the lines. You can still lie on it - and won't fall through. But cut two cross strings, and you lose even the ability to lie on it. Like a hammock - the "strong" length members can't sustain a load if there are no cross threads holding them together.

A net - no, not at all. A net implies hundreds of redundant links and cross threads.

-- Robert A. Cook, PE (Kennesaw, GA) (cook.r@csaatl.com), February 04, 1999.


I think that the hammock analogy is a good one, some crashes are more significant and have more repercussions than others.

While it's true that if one of your suppliers goes down you can maybe switch to another (depending on how specialised they are), if the power grid goes down - well - I think this one has been explored in some depth before, but you get my point.

There are some fundamental functions that must remain working for society to operate as it does now. I reckon that with power and comms things can at least be fixed eventually, without these 2 in particular we'll all end up worshipping fire and eating each other.

-- PowerTiger (powertiger@rocketmail.com), February 04, 1999.


Maybe the picture and comparison of a line of dominoes or a chain, with a net, will become clearer if we acknowledge that "net" is another name for a web. Surely the spider is "casting a net" for its dinner when she spins her web, and the basic support strands, if removed can indeed bring down the entire construct.

As to Nigel's though of applying a "pattern" to the graph paper, that is exactly what I see the Y2K software defect as doing!

-- Hardliner (searcher@internet.com), February 04, 1999.


So you're saying cutting power is like cutting the one rope that ties the hammock to the tree.....

-- Robert A. Cook, PE (Kennesaw, GA) (cook.r@csaatl.com), February 04, 1999.


I think an analogy to the human body could be instructive. The human body is made up of zillions of highly complex, highly interdependent subsystems. These are similar to our current computer systems, in that they appear "programmed" for specific functions, they have a limited degree of adaptibility and fault tolerance, they seem to show evidence of intelligent design yet are fairly blind and stupid in their practical operations, not unlike software.

While taking out certain of these systems will cause instant failure of the organism, there's also a surprising amount of fault tolerance, redundancy, and fix-on-failure, or none of us would be walking around.

-RCat

-- Runway Cat (runway_cat@hotmail.com), February 04, 1999.


BUT he started to whine..........

CAN'T ANYBODY tell us what our 1999-2001 fault tolerance level is? I still submit that it is much lower than it was in 1935-1945, due to the fact that we are using pre-determined decision trees now and then people were making the majority of the decisions.

C

-- Chuck, night driver (rienzoo@en.com), February 04, 1999.


Runway cat - wish I'd thought of that!

Robert: if the energy grids fail 100% and "hard" (ie beyond our ability to fix up at least partially within a week for electricity, or somewhat longer for gas and oil), then yes: end of world. But this was precisely the sort of point I was trying to make about a net as opposed to a chain. The electricity grid is a net, not a chain.

-- Nigel Arnot (nra@maxwell.ph.kcl.ac.uk), February 04, 1999.


AND PS

Please don't start with the "We'll just go manual" mantra. The sheer number and speed of decisions made by machine resident decision trees is such that "going manual" has no true relevance. Can't get us humans to decide quite that fast.

C

-- Chuck, night driver (rienzoo@en.com), February 04, 1999.


But Chuck, don't you feel people are just overall a little smarter now ? I mean, better educated, less likely to be stampeded arbitrarily, higher literacy levels, that kind of thing ? And won't that make a big difference, at least in recovery speed ?

-RC

-- runway cat (runway_cat@hotmail.com), February 04, 1999.



People are smarter now?!? For some time the segment of the population with the lowest mentality has been having a huge percentage of the babies. No, I don't think our collective IQ has risen in recent years.

-- Pearlie Sweetcake (storestuff@home.now), February 04, 1999.

Hey Nigel,

You may wish to thank Paul Davis for his analogy, which you cribbed.

just a thought.

-- Mutha Nachu (---@fire.com), February 04, 1999.


Going manual isn't an option. Take power for example - the systems have been computerised for so long that even if it were technically possible to return to runing them manually the people that know how to do so have long gone, this is the important point. With other industries, say banking I suspect it would be impossible to stay open as usual manually due to the sheer quantity of processing required.

I think that society in general is less fault tolerant now than it would have been in WWII say, for the following reasons:

1) We no longer really live in communities, certainly not in cities (observe the failure of care in the community schemes in the UK). I really think that this makes a big difference as I suspect people will be less likely to pull together in adversity (I would love to be proved wrong).

2) We DEPEND on things like electricty and supermarkets. We didn't then. I remember my Dad talking about how as a child his parents kept pigs, chickens and had a vegetable garden, this was much much more usual than it is now. There are many more people in the world now as well - if the plug was pulled fifty years ago it would effect LESS people who would all tend to be BETTER prepared anyway.

3) It's a global issue now, more than then. We have already witnessed the way economic problems in one country effect others (Japan, Russia, Brazi etc.) The point here is that rather than function as a lot of discrete systems with limited interfaces between each other we now behave much more like a hugh, unpredictable, chaotic global system. We know that countries outside the US and possibly the UK are even further behind on y2k - to the point where effectively I suspect no action will taken.

I think that this last point is the most important one, essentially what I am saying is I think it's more of a chaotic system than a net or anything else. You simply cannot predict the effect of any given cause, this makes estimating the fault tolerance of society as a whole very difficult indeed.

-- PowerTiger (powertiger@rocketmail.com), February 04, 1999.


For the dominoes to start falling, like a nuclear reactor, you have to reach critical mass. The analogy is that the domimoes have to be spaced sufficiently close together.

-- dave (wootendave@hotmail.com), February 04, 1999.

just to complicate things a bit:

I would submit that the fault tolerance of a system can be substantially different from the fault tolerance of it's subsystems. Going back to the 'what's the difference between WWII and now? question I'd point out that actual responses to problems varied by culture. One need only point out the varied nature of the responses of Italy and Japan to the possibility of invasion...

So, y2k is systemic - right? Taking RC's biological model for a minute, what are the chances that due to a combination of factors, including culture, a country or countries could "develop gangrene"? What happens if one or more of the current players goes down and stays down - no nasty weapons of mass destruction - just techno Babel...what then?

Arlin [who thinks all the Lenten music at choir practice may have affected his outlook a bit.]

-- Arlin H. Adams (ahadams@ix.netcom.com), February 04, 1999.



Nigel,

RC made a good point of systems being like the human body. It depends on where the problems are. You can have an accident by the failure of 5% of a car (the tires).

85% of business and government agencies in the U.S. will have their mission-critical systems compliant by 2000. Of all the sectors of the economy, the most compliant one is the banking and financial world. That *is* good news.

The bad news is that the public utilities sector is the LEAST prepared of all sectors.

Got H2O?

-- Kevin (mixesmusic@worldnet.att.net), February 05, 1999.


Consider too that the "single point" failure (like what was said about the one rope holding up a hammock) brings everything "from what it used to vbe" can still be repaired, spliced, or replaced; tied back up; then the hammock reused.

But fall was still painful, and some people may break their backs. I agree with the general "character" of the people appears to be lower than before, and than many fewer people "know" how to manage without the infrastructure working absolutely correctly.

Compare, for example, NY or LA with a massive long failure, and Calcutta or Bombay - the Idians would in general survive much better after two or three weeks. there just are too many people in too an area to manage without their services.

Unfortunately, many of those will consider "essential services" their handouts and TV, rather than a broom and mop to clean up.

-- Robert A. Cook, PE (Kennesaw, GA) (cook.r@csaatl.com), February 05, 1999.


Chuck, PowerTiger

In many cases "go manual" is AT PRESENT not an option, because the organisation that tried to would be competed out of existence very fast. If everything is a big mess, this doesn't apply. As I've said before, in a shortage economy the reaction to a failed supplier is to try to help them out of trouble, and the reaction to demand exceeding supply is to raise prices (either suppressing demand or raising capital with which to expand or recover).

With respect to the power grid, it's harder to say. On one hand SCADA stands for *Supervisory* control and data acquisition; the underlying mechanisms work fine without computers. On the other is a horribly big unknown. Rick Cowles is predicting big trouble, not catastrophe, and has the sort of detailed knowledge needed to try to assess this. Dick Mills is more optimistic. Also the fraction of embedded systems with Y2K problems is proving to be small, but again the impact of a small percentage of failures on a control network is again unknown.

With respect to people adapting in a crisis: in WWII half the regular workforce got conscripted. The "home front" was "manned" to a great extent by housewives, who did a great job. in most cases they had no prior industrial or agricultural experience; in many cases, no experience of non-domestic work at all.

With respect to "single point" hammock failure: you can splice it up again if you can obtain thread, wire, sticky tape, or even a sheep(!) in time. The critical point is where so many things need splicing that the supplies needed to do so run out, and manufacture of new supplies isn't possible because the ropes are broken ... this is Infomagic's end-of-world scenario.

I personally don't think it'll get anywhere near that critical point, but I wish there was a way to get a proper handle on the problem instead of mere "feeling". Simple models aren't realistic, and realistic models aren't predictable -- in the mathematical sense, they are indeed chaotic.

-- Nigel Arnot (nra@maxwell.ph.kcl.ac.uk), February 05, 1999.


Moderation questions? read the FAQ