IEEE Y2K Chair: The Fat Lady Has Yet To Singgreenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread |
IEEE Y2K Chairs Pre- and Post-Rollover PredictionsI am the Chair of the Year 2000 Technical Information Focus Group, Technical Activities Board, The Institute of Electrical and Electronics Engineers (IEEE), who came on this forum in a critique of Ed Yourdons implied focus Rollover as about the most important event of Y2K (Y2K End Game) This set off some questions and discussion. (See http://www.greenspun.com/bboard/q-and-a-fetch-msg.tcl?msg_id=001jnW) I wrote that piece, which was first published on Roleigh Martins Website, on my own with IEEE involvement.
Prior to that it was my IEEE committee (I was the principal author) that wrote the pivotal letter to Congress that broke the logjam on liability legislation and got the bill to the Presidents desk. (Interestingly, his opposition to the bill evaporated at the same time and he signed the bill straight away.) This letter also got a discussion on this forum. (See http://hv.greenspun.com/bboard/q-and-a-fetch-msg.tcl?msg_id=000zgp) This document did go through an IEEE vetting process.
Now I am posting here, as an individual again, my last comments/predictions of 1999 and my first response to the rollover (non)-events. (These were also first sent to Roleigh, who has published several of my writings on his site). These might seen unusual, but, I think, interesting.
From the very first my committee and I were loath to put out anything that was not backed up by rigorous numerical analysis; that is our nature as engineers. But to wait for that to come in without a huge budget and very powerful friends would have been to be out of the picture. We did the best thing we could in the circumstances; we analyzed the various dimensions of the problem (there are many and they all interrelate, as dimensions do) and tried to distinguish probabilities from possibilities and the reasoning to be able to do so. From the very beginning, this put us in the reasoned middle camp with respect to Doomers and Pollys, believing they were both wrong.
The Doomers have been mostly vanquished, as their pessimism centered around the rollover, embedded chips/systems and the power, phone, water, etc. infrastructure. They did not appreciate the degree of engineering that went into those utility systems, what it actually takes to produce a Y2K error and the lack of real year sensitivity. Hidden chips and 00-years did not necessarily mean hidden problems, nor did problems mean failures. The Pollys have had a turn to dance. But that is probably premature. They do not appreciate the size and complexity of the business software infrastructure that must now stand the test. This test will take months to run before a clear idea of what, if any, damage was done. It will be a cumulative process -- no one failure is likely to have any kind of potential that itself would be immediately visible to the general public (there are few unguarded single-point failure sites) -- but a rash of small errors and failures, mostly contained within the suffering organization(s) with some bubbling out into the wider world to be absorbed (or not) there. Not very telegenic, but that is the reality.
If you would like to understand the rationale for my assessment of the challenge ahead and the one behind, please read the following two relatively short pieces.
Thank you.
*****************
Subject: IEEE Y2K Chair's Last Word Before Rollover Date: Fri, 31 Dec 1999 02:37:44 -0800 From: "Dale W.Way"
To: Roleigh Martin Roleigh,
You have been kind to me in publishing my work and I want you to know I appreciate that. Here is my last word before the rollover about what to worry about and what not to worry about, but not my last word. Rather than get philosophical at this point, I prefer to address things from the intrinsic technical vulnerability of things and leave the softer analysis for later. Please feel free to put this on your list server. Thank you.
1. PHYSICAL CONTROL SYSTEMS, those that control physical things and processes, such as power generation and distribution, water treatment and distribution, phones, airplanes, elevators, traffic lights, etc., are in very good shape for rollover and beyond.
1.1. Errors based on the zero problem (software getting a zero in a field/variable, like a year, the programmer never expected to be zero -- zero is an unusual number for both mathematics and computer science and if a positive integer is expected and a zero shows up, there could be an error) will be quick to show up and generally easy to locate and fix if they do.
1.2. The more systemic errors, those that involve computation on multiple dates that span the century boundary where the math doesn't work, will be a risk but that vulnerability will only last for the few seconds or minutes before all dates are safely on the other side of the boundary. These systems are predominately real-time and their view of time (even with dates attached, going along for the ride) is very short; the width of the time window they process is very narrow. (An example of this is the problem announced as fixed today in FAA ATC computers where if the primary died and the backup kicked in within a ten second window around rollover, the backup would use the data from the 1999 file -- ten seconds out of date; airplane would look to be in the position there were actually in ten seconds earlier. But after ten seconds, all would be well. This was not a big risk, but the FAA fixed it anyway. See http://dailynews.yahoo.com/h/nm/19991231/pl/yk_airlines_2.html) There is inherently less year sensitivity in the physical control system infrastructure.
1.3. Physical control system have been generally easier and safer to remediate. For a number of technical and managerial reasons, physical control systems are better engineered, stress tested on a regular basis to force errors and firmly establish operational limits, and therefore better understood making them more accepting of effective remediation; vulnerabilities are easier to identify, locate and eliminate and the chances of those modifications causing inadvertent error elsewhere in the system, or to system dependent on that system, are much less than for more informational (data processing) functions and systems.
1.4. Physical control systems have relatively less software and there is more convergence between hardware, system software, application software -- often all special purpose for the express task they do -- and task they do than in many informational (data processing) functions and systems.
1.5. Even if any of the above kinds of errors do occur, there is often redundancy and other cushioning functions engineered into physical control systems than do not exist in more informational (data processing) functions and systems to keep the overall system functioning.
1.6. Even if any of the above kinds of errors do occur that break through redundancy containment, there is a very high likelihood that they can be identified, located and fixed quickly and safely.
1.7. As the systems that produce the goods and services that define those organizations that have them, physical control systems are revenue producers, and therefore have always had management attention and investment.
2. PRIMARY PRODUCTION SYSTEMS OF AN ON-LINE TRANSACTION PROCESSING (OLTP) NATURE in the information processing area, as in banks (ATM and check processing functions), other financial enterprises and government, are at more risk than physical control systems, but still in fairly good shape around rollover and beyond.
2.1. Primary OLTP production systems have many of the intrinsic invulnerabilities of physical control systems: being close to real-time means narrow date processing windows that means narrow intrinsic vulnerability, highly focused and tuned means more convergence between hardware, system software, application software and task, better understood means a greater acceptance of effective remediation and quicker and safer repair of errors that do occur. Yesterday's report on the failure of 20,000 credit card terminals is an example of an inconvenient, minor economic-cost failure in these systems. In that case forward-looking date processing where the leading edge of the date range crossed the century boundary, causing a problem. The "fix" was to wait for all the dates to be on the new side of the boundary in three days. See http://www.salon.com/news/wire/1999/12/30/y2k_london/index.html
2.2. But primary production systems in the information processing area are not engineered with redundancy and stress tested on a regular basis to force errors and firmly establish operational limits, as in physical control systems. Failures there will be harder to contain.
2.3. Organizations who have primary production systems in the information processing area often have more kinds of production to do (not just OLTP) than in the physical control area, meaning more kinds of heterogeneous technologies (architectures, platforms, languages, database structures, etc.) of more diverse vintages/ages, to oversee and those diverse systems are more often interconnected and interdependent with each other via share data sources, making such systems more resistant to safe and effective remediation. Errors that do occur will be more difficult to track down and fix quickly and safely.
2.4. But as the systems that produce the goods and services that define those organizations that have them, primary production systems in the information processing area are also revenue producers, and therefore have always had management attention and investment.
3. SUPPORT SYSTEMS, those that monitor and detect faults, schedule maintenance, order spare parts, etc. and to some extent manage the primary production systems for efficiency at either the physical control or informational end are at somewhat greater risk than the primary production systems themselves.
3.1. Support systems have wider date ranges and therefore windows of vulnerability (often weeks or months), than more real-time, narrow-windowed production systems.
3.2. Support systems are not as well engineered as primary production systems, are not stress tested, have limited redundancy. Failures there will be harder to contain.
3.3. But support systems are not generally overly complex; they are generally well understood, making them accepting of effective and safe remediation and errors that do occur are relatively quick and safe to fix.
3.4. Support systems act as an immune system to their primary production systems. As such, Y2K is like HIV to them and the production systems; if frequent or persistent failures occur here, over months, they will eventually degrade the reliability and performance of primary production systems.
3.5. Support systems directly support the primary production systems and as such are seen be management as essential to revenue production. Therefore they have generally gotten management attention and investment.
4. ADMINISTRATIVE AND ACCOUNTING (A&A) SYSTEMS, those that support the general economic functions of the organization (purchasing, order processing, invoicing, accounting, personnel, payroll, tax reporting, etc.) are at very great risk and will be for a long time. It is in this area that severe, widespread and persistent failures are most likely to come, if they do at all. These systems present virtually the worst-case scenario for all risk factors of Y2K. If any systems lock up in a snarl of interlocking errors and destroyed or corrupted data in an unrecoverable state, it is likely to be these systems.
4.1. A&A systems are almost all software and after years or decades of extensions and modifications are extremely complex, making them highly resistant to safe and effective remediation and making any error that do occur difficult to detect, locate and fix safely and effectively.
4.2. A&A systems have very wide date ranges and therefore windows of vulnerability (often weeks or months).
4.3. A&A systems are very large, highly interconnected with many shared data sources and often composed of heterogeneous technologies (architectures, platforms, languages, database structures, etc.) of more diverse vintages/ages and are relatively not well understood, also contributing to their being highly resistant to safe and effective remediation and making any error that do occur difficult to detect, locate and fix safely and effectively.
4.4. A&A systems are relatively poorly maintained in terms of good engineering practices, stress testing, etc. and have little or no redundancy built in. Failures there will be harder to contain.
4.5. A&A systems have always been viewed by management as exclusively a cost center and have received relatively little management attention and investment. Cutting costs have been more often a desire and this has led to a ghettoization of many such systems, especially those built on mainframe technology. Yet the data they produce is extracted and used by other applications all over the enterprise.
In summary, physical control systems and other primary production systems are not at much risk and not for long, while administrative and accounting systems in all industries are at great risk for a long time. Support systems are in the middle, but closer to production systems at the lower risk end. If we are to have major damage to the economy and system infrastructure it will come from administrative and accounting systems and those dependent on them. This will have long-term impacts on the economy, but the shape of those is hard to decipher at this point. There is still much we can do by way of adaptation to mitigate such failures, but if they persist anyway, we are in serious trouble. Clear heads and calm hearts are called for in any case. We will get through this.
Thank you for your attention.
Dale W. Way Chairman Year 2000 Technical Information Focus Group Technical Activities Board The Institute of Electrical and Electronics Engineers (IEEE)
*****************
Subject: IEEE Y2K Chair: The Fat Lady Has Not Yet Sung Date: Sun, 02 Jan 2000 19:10:19 -0800 From: "Dale W.Way"
To: Roleigh Martin Roleigh,
You know from my writings over the years that the rollover was being overblown as the most significant aspect of Y2K, even being made synonymous with it by many. (Recall my critique of Ed Yourdon's essay Y2K End Game you kindly published on your site.) The rolloveritis that gripped many is understandable in that nothing works like a deadline in getting people's attention and action. The early alarm-sounders of Y2K used this, but did not let go of it when appropriate because they were too often in transmit mode and not enough in receive mode. And when they were in receive mode, they listened to themselves too much, with did you hear! rumors ricocheting all around the world. The media also went strongly with this dramatic focus for obvious reasons.
Then other people with no pretense to computer knowledge entered the game, often with other agendas lurking beneath the surface. They built platforms to stand on and decry all the horrible things that were POSSIBLE with computer chips being embedded everywhere. In the absence of real understanding of the multi-dimensional details (technological, historical, cultural and managerial) of the world of the physical control infrastructure, these misguided people did not bother to study and learn that it takes more than a computer chip (whatever that is) and a 00-year to make a Y2K error, let alone allow that error to build to a failure, let alone let that failure emerge into public visibility. The details have always shown that the PROBABLE in that world was much smaller and less threatening. This distinction was, to some extent by choice, beyond their understanding.
The root of the word science is skei- to cut, split, to separate one thing from another, to discern (American Heritage Dictionary, 3rd Edition). The above people, whom I am painting with a broad brush not in every case justified, failed to discern the various dimensions of Y2K, one from another, and how those dimensions would combine into a more predictable reality. They too often lumped Y2K into one thing and attached their own hopes and fears to that monolithic notion. But there is another world that Y2K threatens that has gone undifferentiated by these people and much of the media as well. The Feds and other powers that be have not been so simple-minded. That is why you are seeing warnings that we are not out of the woods yet, that problems could still emerge in the near future. The reason is simple:
We are leaving the world of LOW intrinsic vulnerability and HIGH remediation capability: the physical control system infrastructure -- for one of HIGH intrinsic vulnerability and LOW remediation capability: the data and information processing infrastructure.
We are going from a world of engineering based on scientific principals to one of art or craft based on ever-changing business fashions (along with more stable, but still slowly changing accounting and other regulatory principals). We are going from one of lean, special-purpose technology tied tightly to the task it is to do -- to one of over-featured general-purpose technology adequate to many tasks it could be asked to do, but not particularly good at any one. We are going from a world of moderate size and limited interactive complexity -- to one of immense size and great interactive complexity. We are going from a world where most systems are very well understood -- to one where holistic understanding dims after one or two steps away from the element being examined at any one time. Where a correct fix here can bite over there, because the interdependencies were not understood and the testing infrastructure and time were inadequate to not only catch it and track it down, but to devise and retest a completely safe fix that would not bite somewhere else.
I do not know what is going to happen. There is still much human organizations can do to mitigate and ameliorate errors and failures that do occur. Much will be held within organizations and not spill out into any kind of public view. Such things are not necessarily reportable in the crisis-management sense of things. But to the extent they occur they will likely accumulate, to build up. If our collective ability to overcome them falls behind that occurrence rate, there will be disruption and damage, from processing slowdowns, to data loss or corruption, to serious system lockups. It will take a few months to really know.
The fat lady has not even walked on the stage yet.
Thank you for listening and posting this to your list.
Dale W. Way Chairman Year 2000 Technical Information Focus Group Technical Activities Board The Institute of Electrical and Electronics Engineers (IEEE)
-- Dale W. Way (d.way@ieee.org), January 03, 2000
Thank YOU Mr. Way. Thank you very, very much!!!Mike
=====================================================================
-- Mike Taylor (mtdesign3@aol.com), January 03, 2000.
Precisely.I went nuts trying to find my thread of recent vintage on the schedule for analyzing Y2K impacts. Can anyone help with a link? Would fit here.
-- BigDog (BigDog@duffer.com), January 03, 2000.
Thanks again. Your prentation is helpful, and your perspective is unique. My only contention: many of us tried very hard to understand the embedded chip problem to no avail (in retrospect). Our failure was not from lack of effort. Your Q&A on this forum some time back was enlightening and, at times, a little surprising. So far, this thing has played out according to your playbook.
-- Dave (aaa@aaa.com), January 03, 2000.
to the top!
-- none@nope.naw (none@nope.naw), January 03, 2000.
Mr. WayThank you for taking the time to post on the forum. It would appear to me that the Y2K problem remains a management problem from your comments.
And thank you for this statement
"""THE LIGHTS WILL NOT GO OUT AT MIDNIGHT.""" From the thread IEEE Y2K Committee Chairman takes questions.
The Watershed Significance of the IEEE Letter to Congress
IEEE Y2K Chairman's Personal, Pessimistic Take on Y2K and Yourdon's End Game Paper
Subject: IEEE Y2K Chair Message to EY (Mr. Ways essay on Roleigh's site)
IEEE Y2K Chairman takes questions
Very serious (not light reading) Y2K assesment from the IEEE (PDF File)
IEEE TAB YEAR 2000 TECHNICAL INFORMATION FOCUS GROUPComments from Cory
Hamasaki: Enterprise systems: the real Y2K issuePS Group I have been enjoying real life offline, welcome to the Year 2000! It was a safe rollover, now we watch for the thousand cuts.