What Ever Happenened to Implementation?

greenspun.com : LUSENET : Electric Utilities and Y2K : One Thread

Back in the early days (like six months ago) it was assumed that utilities would go through four phases: Inventory, Assessment, Remediation, and Implementation.
Remediation is not called Remediation/Testing. Testing is done at the unit and subsytem level.
Implementation means getting the fixed/tested equipment installed and operation in the system.
Implementation isn't even the last step. At some point you have to test everything working together and then everything working with the supposedly compliant telephones and other supplier equipment.
Six months ago the talk about system testing all but ended; save a few utilities that are actually doing system tests.
The NERC report seems to have dropped emphasis on implementation. We now have the term remediation/testing instead of remediation.
Perhaps I have missed the boat.

-- Anonymous, January 23, 1999

Answers

I started working on Y2k programs about 4 years ago, and have worked on many Y2k programs since then. While the semantics may have changed over the succeeding few years, the distinct process steps remain the same:

Awareness
Inventory
Assessment (Inventory and Assessment phases have also been lumped together now, but in reality, these are two very separate stages...)
Testing
Remediation
Acceptance Testing (production and regression testing, when applicable)

While these process steps can be worked (to some degree) in parallel, you have to have a very organized project to do so. In most electric utilities that I've worked with, this isn't the case - there are too many interdepartmental turf barriers because of the geographic dispersment of the individual departments and plants. So, for reporting purposes, the project plans take on a look similar to this:

Awareness/Inventory/Collection of vendor statements (no lines of demarcation between the individual phases)
Risk Assessment (legal, not operational)
Remediation/testing

What's wrong with this picture?

-- Anonymous, January 23, 1999

It is difficult to adequately define testing in term of IT/software projects. The problem is that in many cases, early tests illuminate large shortcomings in the remediation phase. If the problem is large enough, testing stops until the newly found problem can be fixed. This cycle contines getting closer and closer to the target until the number of failures is acceptable. The system is then put into production and the failures that arise are fixed in real-time. A high degree of internal IT team communication can shorten this process (a skunkworks approach), but complex systems are just that. Mythical Man-Month tells us that Accidental Complexity can be reduced, but Domain Complexity is fixed and unchangable. In non-geek speak ... we are trying to make the sand fall through the hourglass more rapidly.
This is why that *real* testing of large systems takes a year or more. I was project lead on a small project that included a team of about 20 people. On that project, not only did testing point out shortcomings in the implementation, but it also pointed out holes in the *design*. If the testing team is finding design flaws, the project is in serious trouble. That particular 9 month project lasted for 3 years!
If serious system testing with live data isn't already underway, then our normal testing phase will not happen. This means that as of now, we are in a fix-on-failure mode for most of the projects. Fix-on-failure development is always inferior to properly planned development. This will have a cascading effect that ultimately will produce more errors to fix-on-failure.
Bottom line. To me the whole terminology game here is nothing but mud soup. Sure, I know we need the distinctions to help with organization of the attack on the problem. However, please realize that that whole premise on which these numbers are based VERY subjective. There is simply no measuring stick for completeness of IT projects due to Domain Complexity.
My favorite truism distilled from 13 years of IT experience ... "How long will it take? Ask me when we are finished!"

-- Anonymous, January 24, 1999

The uninitiated should realize that there is a large amount of signal to noise on this little thread.
In the best of times, there is somewhat of a disconnect between mgmt and IT (and responsibility for that cuts both ways). Rick's comment illustrates something unique to Y2K efforts (cf Risk Assessment): not only the legal but the political dynamic (taking "politics" both narrowly and broadly) often replaces the most obvious technical requirements for succeeding in this critical area.
We won't know until next year, unfortunately, whether politics swamped remediation but you can bet that the always-slow pace of IT (see David Smith above) has been made still more glacial by the politics. Carmichal has proposed "viscosity" as a metaphoric measure of Y2K consequences that we may experience in 2000 (as JIT "slows down", etc), but organizations are already experiencing it internally.
The other salient issue here is, as Tomczak points out and Cowles confirms to those experienced in IT, the absence of calendar time left for adequate production testing.
That is, while various types of testing take place both serially and in parallel, it's simply too late for acceptance testing, given NERC's own reported compliance statistics.
Yet, no responsible IT executive would *ever* (let me repeat that, *ever*) release an important system without such testing.
Obviously, the fixed schedule for Y2K dictates that "whatever is 'done' will be released," but it doesn't take a rocket scientist or an IT professional to grasp the likely consequences. Thousands of systems will be released *simultaneously* without acceptance testing.
David Smith references Fred Brooks above (Mythical Man-Month). My father-in-law worked with Fred on the first, infamous 360 project. It was a debacle, yet acceptance testing was performed (of course)! Nothing is more telling than the simple fact that we are repeating crucial errors known to be disastrous more than 30 years ago at the very time we can most ill afford to repeat IT mistakes.
It may well be that the lights stay on. I am not a utility expert. But it is incumbent upon the industry, and especially utility IT professionals and consultants, to raise an additional public, and not only a private alarm about the probable consequences of skipping acceptance tests.
Otherwise, "compliance percentages" are thrown into the air by industry managers and picked up for reporting by the press to an ignorant public without any reality connect at all. If politics overwhelms the need to warn the public, all in the name of preventing panic, the risks to communities drastically outweigh the presumed short-term public relations benefits.

-- Anonymous, January 24, 1999

As and addendum, implementation (with code frozen in place) should occur before 1/1/2000 Rollover. Big shops should freeze production code by 10/31/99. Of course, I know that isn't going to happen. There are going to be a lot of systems placed in production the last week of Dec 99. History says they will promptly crash the first week of Jan 2000.

-- Anonymous, January 24, 1999

Cannus Maximus has reduced the complexity of y2k down to 10 words:
Thousands of systems will be released *simultaneously* without acceptance testing.
Ten more words are in order:
Let not this become the epitaph of a stillborn century.

-- Anonymous, January 24, 1999

...and let me add one last thought/translation to Mr. Herr's and Mr. Tomczak's thoughts:
Because a lot of code and control systems will be rushed back into production without adequate acceptance testing, there will be a lot of code and control systems breaking post 01/01/2000 because of lousy testing, not because of true undiscovered Y2k issues.
And I'll go out on a limb and make a prediction: many failures of systems post 01/01/2000 will be blamed on everything *but* Y2k-related failure.

-- Anonymous, January 24, 1999

Big Dog, my hat's off to you. Your experience and depth of understanding are manifest in your post. Thanks for adding *much* to my post.

-- Anonymous, January 24, 1999

Rick, with regard to your above prediction, I think you're closer to the roots rather than the leaves. I know public relations. I know how it works. With rare exception--when a company has an obvious problem that affects the public--the blame is shifted. Who would ever want to publicly blame a now dead or retired, revered CEO for short-sighted thinking?
Again--as Deming TRIED to teach the world--it's all about systems. When you engineer quality into the system, you produce a quality product. When you rely only on testing--after the product is produced--you are hurting your organization.
I blather on. ('Time to talk to my doctor about getting my first prescription for Valium. While they can still make it. Hee hee. Not.)
Have a great week.
MB

-- Anonymous, January 25, 1999

Rick --- is there any way, given your positioning within the industry, to raise the alarm on these specific issues to a qualitatively different level?
Meaning, as the calendar moves on and the dynamics of Y2K shift, refocusing the debate on the consequences of rushing or ignoring acceptance test, rather than on a few more-or-less percent of "compliance" reached on the remediation phase?
You guys know this is far from an academic issue. RD's point about frozen code date is appallingly germane to reducing (perhaps) the Y2K impact post 1/1/2000. And, pessimistic though I am about industry and organizational dynamics, it isn't 10/31/99 *yet*. I'm oversimplifying, but on 1/25/99, the question for utilities (all sectors, really), shouldn't be:
.... how much code is compliant in percentage terms (and don't get me going on that)? but
.... when in 1999 are you going to freeze "Remediation/Testing" and perform "Acceptance Testing" on what you've done?
Then:
If the given system (or subsystem) can't be tested without complete collapse on 10/31/99, 60 days are needed for intense contingency planning.
If the system executes on 10/31/99 with 'x' faults, 60 days are needed for high priority fix-retest and contingency planning.
If the system executes on 10/31/99 with non-fatal glitches, you're aces. Move on to the other systems that collapsed or had the 'x' faults.
I would personally argue for 9/30/99 (Ha! 1/1/99) but why cry? 10/31/99 probably is the soonest possible date that responsible organizations can bite down hard and face the (immediate) future.
Rick, you and Roleigh Martin are waist-deep in the stuff and among the few Y2K alarmists who command attention in the industry. Only you can decide when or whether it's necessary to "go ballistic" to get above the current noise level. I don't honestly know from inside the real scope of the risks to the industry or the country .... maybe even you don't.
I do know from my own consulting experience that one of the big rules for remaining in the club with the big boys/girls is you agree jointly to *never* flat-out say the emperor has no clothes. If this emperor has clothes on, however ragged and torn, fine. But if he/she is naked, would you please tell he/she so in the plainest possible terms?
Or, more pertinently, if the "clothes" need to be picked up from the thrift store between 10/31/99 and 1/1/2000, would you tell the emperor to stop all other nakedness-covering work by 10/31/99?

-- Anonymous, January 25, 1999

Moderation questions? read the FAQ