feasibility of "fix-on-failure" contingency plans

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

As time grows ever so short, more and more I see references to contingency planning, with "fix on failure" being one touted method. At face value, this seems perfectly reasonable ... until one considers what we are dealing with. Computer systems that simply give wrong answers; embedded systems that outright fail because of a single chip (possibly not even accessible much less locatable), which itself might not be replacable with a ready made compliant version; perfectly working Y2K compliant systems (taking a leap of faith here, folks!) that have been corrupted with bad (non-Y2K complaint) data; etc. Comments of feasibility of "fix-on-failure"? (And consider: this method is the one most touted as being able to make Y2K problems be short-term rather than long-term.)

-- Joe (shar@pei.com), September 09, 1998

Answers

Fix on failure is only reasonable for non "mission critical" areas in standalone (i.e isolated from other computer) systems. The smaller the overall suite of application programs the better. In large, complex computer installations with layers of multi-system interfaces, its a horrible idea. The trick is defining "mission critical" and having the resources available to fix a given problem. Something that is seldom mentioned, but will become very important, is the availability of the same (identical) language compiler and link editor that was used to create the object code in the first place. I am aware of a significant problem in using "new" versions of IBM COBOL and PL-1 compilers/link editors. Many businesses are going to find out they actually don't have multiple versions of these things and there will be new, non Y2K problems introduced when they try to re-compile. These errors are in a sense independent of the code change. Many of todays virtual BASIC programmers are ill equiped to deal with these issues. Sorry, I've gotten so verbose, but the answer is "fix on failure" means lights out.

-- Ron H. (drherr@erols.com), September 09, 1998.

"Fix on failure" can be a euphamism for several situations:
1. "We don't believe this system will fail, so we will simply wait and see."
2. "We don't know wheter this system will fail or not, but we don't think a failure here will not hurt us too much, so we aren't going to bother until and unless something goes wrong."
3. "We think there may be some level of damage from failure of this system, but either we don't think it is a mission-critical system or we have other systems that are even more critical. Therefore, we are going to expend our resources on the really critical stuff and hope this system doesn't crap out too badly."
4. "We don't know what the hell is going on with Y2K, so we are just going to sit back, do nothing on this system and pick up the pieces when and if things go south."
If I had to rank them in probability, 3 is most likley to be the real meaning, followed by 2, then 4, then 1. But that's just my opinion and my guess. Your milage may vary.

-- Paul Neuhardt (neuhardt@ultranet.com), September 09, 1998.

To a limited degree, "fix on failure" does "ease" the testing/finding/analysis phase, since "many" problems have become self-evident. Fix on failure of course leaves thousands of "hidden" problems, the usual "created error by fixing software", oops I types wrong, oops I tested wrong, etc.
Also, IT departments will have (finally) the undivided attention and budget to fix Y2K problems (since they are business critical) rather than theorectical. Also, some problems will not be discovered until "real world use" is applied by consumers. In this, everybody in every IT department will to a limited degree (and for embedded control systems to a great degree) be doing "fix on failure".
But I question (ridicule) those who would treat "fix on failure" as a primary diagnosis method. For exampel, if power failure and services become disrupted (to any degree) the ability to sit in a routine (or even emergency) working environment and correct things will certainly be "iffy".
Important related questions: And just how did people debug programs before Edison invented the light bulb? Did they work by candellight? How did they get Cokes out of the vending machine? Who will run the coffee maker? How long can you run a laser printer off of a 12V car battery? A 9 Volt radio battery? If you only have one small emergency generator, can two programmers share their monitors? Can 1 person run a computer and another person use the monitor? Could they use two mice at the same time to save power? Does double clicking take twice as much power as single clicking? Do you save power by getting a mouse with three keys, and programming the middle button to act a "double click"? Does it take less energy to move the mouse "up and down" the screen, or left and right? In NZ or Australia, do they have to hold the mouse down against the mouse pad, and if so, does this increase fricttion losses?

-- Robert A. Cook, P.E. (cook.r@csaatl.com), September 09, 1998.

"Fix on failure" would in many cases result in making many fixes over a period of time, as various errors surface. Some errors are so subtle that it would take a while to find and fix, and the results would have propagated themselves into databases, etc. For example, unless remediation is done ahead of time, it is almost certain that any date arithmetic bracketing February 29, 2000, would result in error - but error of a kind hard to notice immediately. If you waited until you were aware of the error, but meanwhile you had mailed several hundred thousand incorrect statements to your customers, the costs to recover would be substantial, both in dollars and customer goodwill. These types of errors might be deemed "sins of commission", that is, the system did something, but did it wrong.
More subtle errors could be deemed "sins of omission". Consider some systems where meeting a deadline is critically important. The error might be that the system took no action when it really should have. If you found out (after the fact), by failure of your scheduling system, that you violated the terms of a contract for shipping materials or making a payment, or failed to file a tax report or legal claim within the defined time limits, the penalties could be severe. But since the system did nothing, there was no output to check for correctness of action, and it could take a long time before you discovered the mistake - past the time when you could take restorative action. <<<<<<<<<<>>>>>>>>>

-- Dan Hunt (dhunt@hostscorp.com), September 09, 1998.

Fix on failure assumes the source code can be found.

-- Amy Leone (aleone@amp.com), September 10, 1998.

Additionally, fix-on-failure assumes that the confidence level to trust computers will still be there. I mean, consider the Y2K fallout environment where computers are no longer trustworthy. Continuing to apply "fixes", which themselves may result in further problems (thats why 50% or more of a supposedly well controlled Y2K remediation project is testing!), will not be well supported by anyone.

-- Joe (shar@pei.com), September 10, 1998.

To be honest, fix on failure is the mode used now when production systems screw up. It is certainly a lot cheaper than remediation, in that you can wipe out that whole assessment phase, plus you don't need a separate test computer. Problems are that all the programs will be crashing at once, stressing existing resources. Additionally, if the fix you decide to make involves expanding date fields in the database, then all the old data will have to be converted. Normally with fix on failure you don't have to make database changes, this can be time consuming. Changing code is relatively quick, if that is all you have to do. If a business doesn't mind being out of business for a month or more, fix on failure could work for them.

-- Amy Leone (aleone@amp.com), September 10, 1998.

And (continuing from Amy's post...), if people don't mind not getting paid for a month, not having electricity for a month, starving for a month ........

-- Joe (shar@pei.com), September 10, 1998.

In the long run, they'll all be warm and have food. It's their bad luck if they have a tendency to get cold and hungry in the short run. .....

-- Dan Hunt (dhunt@hostscorp.com), September 11, 1998.

Moderation questions? read the FAQ