stupid question?

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Now if this is a stupid question, chalk it up to my being my being a carpenter and not a computer expert. If some are waiting to fix on failure, when there is a failure, is there an indicator to show where in the code that problem lies. If not, it seems to me that it would take more time to fix in a crisis situation that now. Maybe someone can explain to me why my thinking is in error.

I appreciate the experience and expertise shared on this site. I started taking the y2k serious when windows on my pc crashed repeatedly, finally taking out my hard drive. It took me a day and a half install a new one and install the operating system from scratch.{I'm a carpenter remember} I felt like I was playing "Mother May I" with a sadist. It made me realize how the slightest thing out of order will make the whole process not work.

Later

-- Usually a lurker (curious@yellow.com), January 23, 1999

Answers

I'm not a computer expert either (I'm a graphic artist). From what I understand, the fix on failure thing isn't much of an option. My take is that when someone says that they are planning to fix on failure, it just means that they aren't really planning at all. I don't think that a person needs to be a computer expert to see some major flaws in the fix on failure "plan". Your thinking isn't in error, their's is.

-- d (d@dgi.com), January 23, 1999.

If it's a company with customers that says they're going to do the "Fix on Failure" trick, then they've committed themselves and their customers to waiting until the system is fixed to get back to business. But businesses don't operate in a vacuum. How long can everyone afford to wait?

If your local lumberyard can't get materials for you how long can you stay in business? How long after his stock on-hand is sold out can they stay in business if the computer's down and they can't order more stock?

What if the lumberyard gets their computer fixed and then finds out that their suppliers are doing THEIR fix and it'll take six months? The lumberyard, you, your customers and every other local carpenter and their customers are stuck waiting for how long?

Now suppose it's the power company, the railroad or the trucking company that everyone has to wait on after the lumberyard's suppliers get their computers fixed? It's a grim picture of a never-ending daisy chain.

Fix on failure only works where the end user is the only one affected by the decision. A home computer is about the only example I can find where it would make sense to play fix on failure. Because in a really bad Y2K crisis, people aren't going to have the time and money to spare to get their computer back up and running. Especially if they're worried about keeping their house, their job and staying warm and fed.

WW

-- Wildweasel (vtmldm@epix.net), January 24, 1999.


WW, I have a question about Fix On Failure for home computers. I think many people are planning that. Apple is compliant and the iMac is a very fast, easy computer to use. But ... Apple just announced with pride that they had mastered Just In Time manufacturing and had only 2 days of inventory on hand. 2 days. Do you think the computer retailers will have enough compliant stock on hand to sell every Fix On Failure household a new working computer?

xxxxxxx xxxxxxx xx

-- Leska (allaha@earthlink.net), January 24, 1999.


The difference is like your car vs. a jet airplane. Most people get their car repairs done in a fix on failure mode. A hose breaks, the alternator goes out, you get a flat tire.... get your car towed and fix the problem. If a 747 with 400 people on board has a problem, it is much harder to fix on failure. Engine falls off, landing gear stuck ... This is why the airlines try to fix things before they break. Big computer systems must be fixed before they crash, just like 747s.

-- Bill (y2khippo@yahoo.com), January 24, 1999.

Usually,

I like your description of the harddrive installation - btdt, didn't like it either!

there's another element to fix on failure, that's present with the oil companies like Chevron and Philips. Many of their embedded systems are not readily accessible without shutting down at least part of their refineries... My bet is that they're going to stockpile as much as they can prior to 01/01/00 on the theory that all of their competition is going to be running into the same problems they are, and it will simply be a race to see who can get their facillities back in action first.

Just my 2 cents' worth, Arlin

-- Arlin H. Adams (ahadams@ix.netcom.com), January 24, 1999.



First, let's be honest. Fix-on-failure is what every organization is going to be doing, no matter how compliant they *think* they are in advance. Code maintenance has an element of fix-on-failure that's unavoidable, since the old saw is true that there is no useful, nontrivial piece of software in use with no bugs in it.

It really is true that working backwards is easier than working forwards. What I mean is that once the code actually breaks, you can see the break. Finding the root cause of the break is often time consuming, since the overt symptom usually isn't where the problem lies, it's the *result* of the problem. Often, this is a fairly long chain, where the error happened much earlier, and gradually poisoned the output of subsequently executed modules until it finally came to a head somewhere, with symptoms utterly unrelated to the actual bug. It takes many years of experience to develop a 'nose' for bugs, and hone in quickly on the actual source with relatively little data.

Looking at source code and trying to figure out if something is a bug, and guessing what it might cause down the road is often impossible. Also, looking at whole big chunks of code in a large system and trying to determine if those chunks can *ever* get executed under any circumstances can be tough. I'd be amazed if we aren't busy remediating an awful lot of 'dead' code without anyone knowing it or being able to know it.

Y2k is likely to cause some unusual errors, resulting in error handling routines being called that have *never* been executed before. (As an aside, error handlers are tough to test, because many errors are tough to generate as a test case. My experience is that well less than 50% of error handlers work right the first time, and there's a lot of them out there that have never been run even the first time. They soon will be). Errors in error handlers cause faults that stop computers in their tracks. Expect a lot of this.

About all that can be said for fix-on-failure is that it serves as a filter or prioritizer for the bugs, separating the killers from the cosmetics, and the serial killers from the one-time killers. The relative severity of a y2k bug is not often obvious from reading the source. In most cases it's obvious pretty damn quick in production runs.

If finding and fixing everything you can in advance is (say) a 3-year process, then finding and fixing on failure might be a 1-2 month process. Crash. Fix. Crash again. Fix. etc. After 1-2 months, the computer is running more often than it's down, and hopefully doing most things right. In six months, you're nearly back to your original efficiency, and your code has become *really, really ugly*. Your documentation is worthless. You've generated a zillion versions, and nobody knows which version of which module is actually running.

I repeat, in real life some of this is going to be unavoidable. The more you've done in advance (and the better you did it), the shorter the period of intense pain.

-- Flint (flintc@mindspring.com), January 24, 1999.


I am a programmer, albeit for Win platforms only. Fix-on- failure is THE worst way to deal with a software/firmware problem. Failures in a program are not always apparent (financial calculations showing up as whacked only at the end of the quarter, etc.) and a failure in subrouting A of program B could only cause a problem in subroutine C of program D, not where the actual failure is. (Some folks in the biz call it "error ventriloquism" or "throwing" an error.)

If you're doing business with any entity that has committed itself to dealing with Y2K in the fix-on-error manner and you depend on this business, I'd suggest finding a competitor with their act more together really really fast.

OddOne, who's glad he's not a part (read: cause) of the problem... His code is being NTSL-certified and has always been Y2K- understanding-and-well-behaved...

-- OddOne (mocklamer@geocities.com), January 24, 1999.


I think Flint gave a good synopsis. Fix on failure is feasible if there is good error handling, and the failure produces an error anticipated by the developer. But since event anticipation is how we got into this mess, ....

At least as a big a problem , as OddOne describes, is that fix on failure only applies if something actually fails and generates an error. I do a lot of MS Access database development/remediation, and the biggest problem is that Y2K errors generally don't crash the app. It just ticks along, storing wrong dates and permanently skewing the data.

And U. Lurker, your HD experience is a pretty good example of the pickle we are in. In a computer, as in life, Everthing Is Connected. Life just has better documentation..

-- Lewis (aslanshow@yahoo.com), January 24, 1999.


Many of the Y2K problems will not be of the nature of a crash or error to be handled by an error routine - they were not considered to be in the realm of possiblity. Rather the results will be consequences of taking the "other" path in the logic. This will result in the next job or the user discovering more, less, or the wrong input or resulting output.

-- curtis schalek (schale1@ibm.net), January 24, 1999.

To look at the "fix on failure" mantra a slightly different way.

Pretend there were only 1000 instances of a date being read in a businesses family of programs - the lumberyard or hardware store you're getting your material from. Now, assuming that power is maintained - and utilities like power companies absolutely must avois fix on failure. Even if only 150 date read-and-interpret commands are business-critical, it means that all 150 must be found and eliminated before you can reliably do business with them. How long will that take? As others have mentioned, how can they all be found? How long can they remain in business with make-do accounting and inventory and payroll solutions based on phones calls and promises from customers and suppliers?

Now, perhaps you can still get inventory manully from the owner for a couple of days - because he knows you, and is willing to take your "promise to pay" or check or even cash - how will he remain in business longer without the ability to re-order, to pay taxes, to pay calculate and pay his people and his creditors?

Assume that many things in programming and remediation are minor or non-critical. The very fact that the 85% of programs may fail and not cause a business failure means only that 15% of programs must work absolutely correctly for the business to remain operational. But in the midst of manual and emergency operations, how are you rationally going to isolate, track down, fix, and retest the 15% of the failures that will put the company out of business? They are hidden in the garbage and distractions of the remaining 85% - if indeed anybody can really claim that they've "isolated" the correct 15% in the first place. Fix on failure, when importing data from other companies, will tend to corrupt both databases.

On the other hand, if the business tried to get its programs fixed and tested - now they have only a few things that actually fail in use, and have isolated the trivial programs (or fixed them as well) from the vital, and are able to track down the relaltively few errors that will creep through testing unobserved. For a company that has tested its solutions, fix on failure will essentailly be the last step to complete repair.

Replacing the sparkplugs on one car that won't run is easy compared to recalling 100,000 cars to check whether any spark plugs are of model ABC, lot number 123456, and replacing them. Y2K remediation is like the auto recall - everything has to be recalled and checked - which is a massive job.

But "Fix on Failure" is akin to calling the owner of your auto dealership at midnight on Christmas Eve and telling him " My car won't run, it's on the highway near downtown, get a wrecker, pick it up, and fix it. I'll get it from you at 8:30 tommorrow because I have to take the kids to grandmother's house." Then hang up.

You're not giving the mechanic who has to do the work enough info and time and availability of the car long enough to do the same job.

-- Robert A. Cook, PE (Kennesaw, GA) (cook.r@csaatl.com), January 24, 1999.



Moderation questions? read the FAQ