fix on failure - the tree branch analogy

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

I'm a serious prepper (gold, grain, guns, you know the rest). I'm also a software professional who has written millions of LOC shipping in commercial products. I'll give you a "Devi's Advocate" analogy of why "fix on failure" IS possible:

Imagine a power utility operation. If you told them they must "fix" all branch/tree dangers across all their lines by a certain near date, and GUARANTEE that no branch/tree breakage would occur in the next predicted storm, think of the amount of money and effort it would take. And would such a guarantee, even if they chain-sawed every tree for miles, be completely reliable ? Years of excruciating effort would be required to get even a leaky "guarantee". However, in actual storms, when tree damage occurs, in fact, there is "fix on failure". The problem's location and nature is pinpointed, and repair is accomplished.

Now, an analogy is not an identity. There are similarities and differences in every analogy. In this case, the 2 biggest weaknesses are:

1) the analogy (perversely) applies to all oganizations OTHER THAN utilities. If the power goes down due to y2k, we indeed may be toast.

2) software problems are harder to locate than broken tree branches

Anyway, food for thought. Order another case of survival candles and another thousand round of .308

-RC

-- Run W. Cat (runway_cat@hotmail.com), November 23, 1998

Answers

Sure, "fix on failure" is possible, even desirable, for some non-mission-critical applications (like screen savers, for example). But the question you've got to be asking yourself is this: Just what are "acceptable losses"?

Consider this non-life threating scenario: Joe's Widgits, Inc. uses fix on failure "methodology" and as a result, discovers on 01/04/00, that they can no longer produce widgits. After a week's worth of debugging (and calling customers to explain the delays), the staff at Joe's discovers that the Framitz subsystem (part of the numerically controlled master widgit bearing obfuscating assembler) has to be replaced. A quick call to the vendor reveals that although the vendor survived Y2K, it seems every widgit manufacturer and their brother now needs a new Framitz subsystem and they're now backordered 8 months.

While the widgit industry as a whole may survive, Joe's Widgits, Inc just became a very real casualty.

Yes, they identified the defective component in a fraction of the time it might had taken to find it before the failure. Tell that to the stockholders.

"Fix on failure" is simply a euphamism for "Piss Poor Planning"

-Arnie

-- Arnie Rimmer (arnie_rimmer@usa.net), November 23, 1998.


I agree that fixes may be possible at the very last minute as it were (after all, that's what the Russians plan on doing - Chernobyl, anyone...?).

If things start to break down in society however, *before* Y2k, say Nov/Dec. '99, will the programmers still be around to do the fixes? What are the incentives? They have families too. The same goes in the immediate days/weeks after Y2K. If there is a systemic breakdown, how do they get paid, how do they eat and stay warm if electricity is intermittent, how do they get to work through the panic-stricken mobs, how do they communicate if telecomms. lines are also down?

Last I heard 61% of programmers were planning on taking all their money out of the banking system during '99. This figure will no doubt increase. There will come a time when a percentage of these folks will realise that they cannot fix their own particular project in time and decide to bail.

I really think it's gonna be a mess - I would like to think that these folks will *stay* on the job, but unless one of the many martial laws to force these guys/gals to stay on the job (and this includes myself) is invoked then I can see mass desertion.

I want to stress that this is really a worst case scenario - it all depends on how things play out.

Programmers tend to think in flowcharts if you will:-)

One thing most of them are good at is working out things such as cause and effect/consequences/logic flows. If all the signs point to disaster they will bail in large numbers. Many are freelance, many have had umpteen employers over the years, there is usually not much loyalty to organisations.

It all depends on how things play out next year.

-- Andy (andy_rowland@msn.com), November 23, 1998.


You basically pointed out the big fallacy yourself -- indeed, as you said: "software problems are harder to locate than broken tree branches."

In the case of a fallen tree branch on a line, its just a matter of driving to where a branch has fallen, and exclaiming, "Oh, there it is!". And then fixing it.

In the computer biz, the equivalent of this is to ask, when things go uh-oh, "What changed last?". Then you can usually fix it.

However, in the case of Y2K, you are dealing with All That Remediated Code, or (for companies that are really fix-on-failure "purists"!) you are dealing with Code That Has Not Changed In A Long Time. So, the answer to "What changed last" will tend to be either "Everything!" or "Nothing in a long time!".

This is why fix-on-failure is myth. And why (among other reasons) Y2K cannot be fixed.

-- Jack (jsprat@eld.net), November 23, 1998.

I'm one step ahead of you here. I personally own 4 Framitz subsystems (part of the numerically controlled master widgit bearing obfuscating assembler)as well 3 Burpo Speditzer COBOL unifiers and a batch of Left-Handed Fortran Cable Transformers.

If you don't know what it is, I got some.

-- Craig (craig@ccinet.ab.ca), November 23, 1998.


The troubleshooting to find "fix on failure" will force unreasonable delays, particularly if there is partial or intermittant power, or unreliable data coming in. What is "the right answer" if you can't sort out the "thousands of interfereing "wrong input" datapoints. Which failure caused what effect? What failed - specifically - that must be fixed?

I agree - if you can find the problem - then you can start to find out if you can fix it. But canyou afford that much time. Think about the widget above - six weeks later you get the part, install it, and it burns out or you find out that the next "press" or binder or folder or drill or molder or whatever now needs one too.

Fix on failure is really the final test, and can succeed (has to succeed) as the get it working at all costs- but only after the regular regular de-bugging and remediation have removed the known problems, the testing has identified the 7% that were caused by remediation, then re-testing isolated the input false data/false output.

Craig - you forgot the left-handed calibrated nuclear-qualified Cresent hammer, the reverse-unscrewing metric vise-grip tweasers, and the frozen maleable oxy-dihydrogenated rubber monkey wrench.

And I can't believe you forgot the two most dangerous things of all: an engineer with a screw driver. Or a lieutentant with a pair of pliers.

-- Robert A. Cook, P.E. (Kennesaw, GA) (cook.r@csaatl.com), November 23, 1998.



Four years ago I started preaching to management about spiraling complexity in our systems. They laughed at me, and still do, although Y2K has made them laugh a little less hardy. . They can only see the linear FUNCTIONALLY increase (Lines of Code). System complexity is related to factorials (n!), a type of math that involves computing the combinations and permutations of arrangements of "things". "things" in this instance are software states and interfaces. The factorials look like this

1!=1 2!=1x2=2 3!=1x2x3=6 4!=1x2x3x4=24 5!=1x2x3x4x5=120 . .

So a system that has say five variables can have a typical number of states of around 120. Now say the system has 20 variables or interfaces.

20!=2,432,902,008,177,000,000 (that's 2.4 billion BILLION)

Now, this is a simple and extreme example: the actual numbers--obtained using more complex formula and depending on other factors--are lower. And, modern software techniques allow us to manage complexity fairly well. But you don't have to be a Lead Programmer to understand what we're talking about here....

Now, say that we have X number of programmers. When TSHTF, we still have around X programmers. Say Y2K will cause an initial rise in the number of glitches 10 fold. How are the X programmers, who are used to dealing with X glitches, gonna handle this order of magnitude increase? And this is just *initially*. While the limited number of programmers are fixing those 10X glitches, the glitches will surely be compounded and result in some exponential increase such as 50 or 100X. And where does it end? It will be like a nuclear chain reaction.

I have studied this thing to the best of my professional ability; my conclusion is we're fucked.

-- a (a@a.a), November 23, 1998.


Please watch the language!

-- Women and children present (red@ears.com), November 23, 1998.

He meant, we're shagged.

-- Andy (andy_rowland@msn.com), November 23, 1998.

No, I meant f*ucked...with a capital F.

-- a (a@a.a), November 23, 1998.

The falling tree branch analogy is ridiculous. When you have hundreds of thousands of branches falling all at once in every part of the world..... forget fix on failure.

Yes, the bad word with the capital "F" sounds about right.

-- Paul Milne (fedinfo@halifax.com), November 23, 1998.



Craig: Cool. I'll take two Left-Handed Fortran Cable Transformers, one Framitz subsystem (but only if it comes with the source code and a propane-powered subspace wallpaper hanger), three tricorders and one of those Burpo thingamajigs. And hey, do I get Green Stamps with those?

-- JDClark (yankeejdc@aol.com), November 23, 1998.

Yes, but the glue is on both sides....

-- Robert A. Cook, P.E. (Kennesaw, GA) (cook.r@csaatl.com), November 23, 1998.

"I have studied this thing to the best of my professional ability; my conclusion is we're fucked." a.

Very, very good sentence a. I agree with your conclusion, and you arrived at it from one of the stronger y2k=doom arguments, (there's lots of 'em, take your pick.) As for the precious one who cannot handle the ef word, how silly. If ever there was a topic that required forceful exclamation it would be the topic of just how FUCKED y2k is. It's life and death, not a bridge party.

-- humpty (cos"fucked"iswhatitis@wqerty.com), November 26, 1998.


Since it is obvious that many people object to the use of profanity in this forum, why don't you guys just respect this and refrain? Thank you.

-- Jack (jsprat@eld.net), November 26, 1998.

Moderation questions? read the FAQ