Operator Error

greenspun.com : LUSENET : Electric Utilities and Y2K : One Thread

no one really talks about the impact of operator errors. what happened at peachbottom was the result of a 'tester' error.
isn't that the same as an operator error?
march 28th is the 20 year anniversary of TMI, below is taken from the newspaper article of the harrisburg patriot news. this paper is located within walking distance of the nuclear reactor.
"The unit # 2 reactor suffered a partial meltdown 20 years ago when a pressure valve jammed open,, causing the system to lose water used to keep it from overheating. Human error prevented the control-room operators from identifying the problem for more than two hours."
this was when the rest of the industry was in the "best of times." this was with all 'safety' guages working.
what will happen on the operator level[despite all good intentions] during the "worst of times."
during the late 60's i worked[in management] for a large service bureau that used ibm 360s. many a night we were called in because of a crash that required applications knowledge. why? because the programmers could not locate the source of the problem.
this occurred mainly when we were doing major program upgrades or adding new applications to the existing system.
the systems analyst had graduated princeton and had his MA...he was brilliant and ahead of most in his field.
anyone that tells you we are not going to experience major problems within the electrical or any other industry is seriously delusional.
too much code has been reworked, too much interfacing between the remediated systems is necessary...too much room for error.

-- Anonymous, March 10, 1999

Answers

I completely disagree with calling the Peach Bottom incident the result of human error -- it should be rightfully classified as a computer error. Lets look at what happened, as described by Dick Mills' A Real Life Nuclear Safety Related Y2K Incident of 3/5/1999:
The tester entered a post 2000 date. Nothing at all appeared to
happen. He assumed he entered the date wrong, so he  did it a second
time. This time the process computer functions failed, and as a
consequence, the SPDS (Safety Parameter Display System) went
blank. It took several hours, and reportedly several attempts, to
restore things to normal.
 
The error was that when the tester entered the first date into the
backup computer, it halted. Because the primary computer took over,
it appeared nothing happened. The second command did not repeat
to the backup computer it went to the primary computer, which also
halted.
(emphasis mine)

Any reasonable person should conclude that the tester should have, in fact, received some kind of indication that the secondary (test) system failed, and that the primary system had engaged. (You know, something like: "WARNING: SYSTEM FAILURE, BACKUP SYSTEM NOW ENGAGED".) Especially since the fallback system appears to be engaged seamlessly. Especially when the system is responsible for important stuff like monitoring nuclear reactors!

The fact that this is the way the system works, regardless of whether the failure pre se is induced by Y2K or not, is irrelevant -- it is still clearly a machine, not human, error. Further, Y2K is going to "push" everything -- humans and machines -- to their limits. Poorly designed systems such as the one at Peach Bottom are going to "feel the heat".

-- Anonymous, March 10, 1999

Any reasonable person should conclude that the tester should have, in fact, received some kind of indication that the secondary (test) system failed, and that the primary system had engaged. (You know, something like: "WARNING: SYSTEM FAILURE, BACKUP SYSTEM NOW ENGAGED".) Especially since the fallback system appears to be engaged seamlessly. Especially when the system is responsible for important stuff like monitoring nuclear reactors!
so you are saying that the 'programmer' who set up the test parameters should have programmed a warning when the first system failed...that is still human error and not a computer error.
the point is that mistakes are going to be made, whether it be simply an oversight or an outright error, the repercussions will still be the same.
programmers are working long hours, under alot of pressure, in the face of an immovable deadline and mistakes will be made.
the smallest of errors can take us down.

-- Anonymous, March 10, 1999

While I understand Jack's point that it's possible the Peach Bottom technician did not make an error based on what he knew at the time, I have to agree with roa. There are no computer errors, only human errors. There are computer breakdowns, such as hardware failures, but otherwise a computer does exactly what it was programmed to do by one or more human beings. First law of programming: GIGO. Garbage In, Garbage Out. Programming design deficiencies are also human based.
One consistency I can count on when I'm out with my husband and we hear someone say, "The stupid computer made a mistake," is for him to reply, "No, the computer just did what it was told. A person made the mistake."

-- Anonymous, March 10, 1999

Well, OK, taking what you guys are saying to the limiting case: All computers were built by humans, therefore every "computer" error is really a human error.

Which in a sense, if that's the way you want to look at it, still supports my basic quibble: Dismissing the Peach Bottom Incident as simply a "human" error, and thus not really pertinent to the Y2K problem, is incorrect.

(And I still say that to knowingly write software that allows a system failure without an indication is more than just a goof, its downright deadly. And if, as roa suggested, perhaps the operator knowingly suppressed such an indication, that ought to be considered downright criminal. These people are not flipping burgers, for crying out loud, they are entrusted with monitoring nuclear reactors!!!)

-- Anonymous, March 10, 1999

I think you're missing the point - the label" of what you call the test result is immaterial, the critical fact is that the test was done at all.
The plant progressed far enough in remediation, piece testing, program testing, system testing, and then integrated testing to allow unexpected problem to be identified - under controlled conditions and nine months before the real case could happen.
I'm not happy with the results - I'd rather that nothing ever went wrong, but I'd rather see this result than a series of "canned" demo's that don't expose problems. Now, the symptom (loss of display and values) is found, the secondary problem found (an unexpected interface between primary and secondary monitoring systems), both problem having been found, tracked down, and corrected - good.
Now, do it again. Repeat this and other tests. Find the next interface.
It's critical that the nuclear operators and design teams are doing this - because nobody else is.

-- Anonymous, March 10, 1999

This is a point that Robert is flogging on every forum he can find, as well he should. We want everything in the world tested and audited. If it were, we would be encountering thousands of examples like this one. Maybe tens of thousands. More is the pity for the world that we aren't. It's inductive evidence of how little serious testing is taking place.
So, hooray for this failure.
Whether the nuclear industry will make it is something I can't estimate from professional background. But Robert is spot on that any industry that has long been forced to audit its work or face major political and financial penalties is destined to be in better shape than others that have not.
To wit, the rest of the utility industry.

-- Anonymous, March 10, 1999

Moderation questions? read the FAQ