Forecasting Year 2000 Disruption

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Been away from from the Forum for a couple of weeks so this paper from UK Taskforce may have been posted.

If you haven't seen it,it is well worth reading as you can catch the drift clearly between the technical bits.You will need Adobe Acrobat to read it but this can be downloaded from the Report site.

http://www.taskforce2000.co.uk/pdf/failure.htm

-- Chris (griffen@globalnet.co.uk), December 04, 1999

Answers

PDF file: Failure.pdf

-- spider (spider0@usa.net), December 04, 1999.

For those of us who have WEB-TV or otherwise cannot read PDF formatted files, can someone please summarize?

-- King of Spain (madrid@aol.cum), December 04, 1999.

KOS I downloaded the papers, I would be happy to send them to you in text form if you want. Hardcopy or email.

-- Carol (glear@usa.net), December 04, 1999.

This should do it - the summary from the full PDF document at http://www.taskf orce2000.co.uk/pdf/failure.pdf .

_________________________________________________________________

Predicting Year 2000 Disruption

3 Key Findings

 An assessment of Year 2000 Computer problem outcomes is necessary to enable effective contingency planning. No-one knows what the impact of the problem will be. But it is necessary to make an assessment of outcomes so that organisations can implement effective contingency planning.

 The current focus on 1st Jan 2000, although understandable, is simplistic and unrealistic.

 A Failure Curve can be constructed of computer system failures and disruption occurring over time. Problems have occurred already such as credit card expiry dates and pension maturity dates. Some are happening now. There will be a series of failures occurring over time, resulting in a Failure Curve. The key questions are what is the shape of the curve, and when do system failures translate into noticeable disruption.

 Only rollover errors will occur at midnight 31st Dec 1999. Many other errors, such as implementation problems, will occur earlier or later.

 Organisations have less time than many think to complete their planning. They will begin to experience problems before the end of 1999 and need therefore to have finished their programme and have contingency plans in place well before then.

 There is a drag factor between computer errors occurring and disruption becoming noticeable. Some errors may not be detected for some time until the process that has failed, is needed or is used. It will also be possible to deal internally with some errors so that disruption may not occur for some time. Problems could thus continue well into 2000.

 It will become increasingly difficult for organisations to prevent disruption by managing internal failures. The Gartner Group has estimated that 60 % of failures will occur in 1999. Action 2000 has said that 1 in 10 organisations have suffered problems already. Yet there has so far been little noticeable disruption since each failure has been manageable. This will be increasingly difficult as the rate of occurrence increases.

 Failures will increase throughout 1999, likely to result in noticeable disruption later in 1999 and steadily increasing into 2000. As we approach the end of the year, more business processes will begin to deal with year 2000 dates and may fail. Implementation failures will also occur as new products and upgraded systems are brought into use. The occurrence of failures is likely therefore to increase as we get closer to 2000. The drag factor, however, will mean disruption will begin later and steadily increase into 2000 as more and more failures are encountered.

 Death by a Thousand Cuts is more of a threat than one single catastrophic failure. It will be this build up of problems from multiple failures, rather than a single catastrophic event, that poses the greatest threat.

_______________________________

 



-- themom (themom@canoemail.com), December 04, 1999.

I found the section on EMBEDDED SYSTEMS FAILURES to be rather enlightening. (My wife cried when I read it to her.)

"RTEL (Real-Time Engineering LTD), a company specialising in this area, has calculated a significant statistical probability that around 20 incidents of the type experienced in Bhopal in India a few years ago will occur somewhere around the world."

The USA is in for more than a BITR if 20 Bhopals happen around the world on 01/01/2000...and we eek by unscathed. Guess who will be blamed.

-- GoldReal (GoldReal@aol.com), December 04, 1999.



Thanks for the summary & hot link !I found the cumulative effect of the non-hysterical analytical approach to be quite sobering

-- Chris (griffen@globalnet.co.uk), December 04, 1999.

I pasted the entire report. Hope this is helpful for you who don't have adobe. For educational research purps:

What will happen when the clocks roll over from 1999 to 2000? Perhaps nothing very dramatic. If so, would that mean that the year 2000 computer problem (the dreaded Millennium Bug) is a lot of nonsense? Unfortunately not. Few people understand that the problem is much more about the way computers process dates than it is about clocks changing from this to the next century. For example, if an administrative routine has to calculate someones age next year, the computer compares a date in 2000 with that persons birthday - obviously in this century. It is highly unlikely that such a calculation would need to be done at midnight on December 31. It might be done at any time - this year or next. Either way, if the computer cannot recognise 2000, the answer is likely to be wrong. Millions of date processes like this are happening every minute of every day. And, as 1999 progresses, they are increasingly likely to involve dates in the next century. We need to know when problems might occur and what might be the consequences. These are matters that Ian Hugo studies in this paper. But he has gone further and examined the possible effect of other issues. For example, there have already been cases of attempts to fix the problem introducing new errors. Can we expect to see more of these? And what about the knock-on effects of one problem leading to another? And what might be the delay between a technical problem occurring and its practical impact on a business or individual? How about the consequence of congestion - with multiple difficulties happening at around the same time? He looks at the possibility that, if insufficient systems are fixed, we might experience, not an apocalyptic meltdown, but a gradual accumulation of relatively minor nuisances. Leading, possibly, to death by a thousand cuts - occurring in the early months of next year. No one knows what will happen. But businesses and individuals have to make hard decisions and to plan accordingly. They may have less time than they think. Robin Guenier - Executive Director,Taskforce 2000 Dr. John Perkins - Director, NCC Robin Guenier, Executive Director, Taskforce 2000

3 Key Findings  An assessment of Year 2000 Computer problem outcomes is necessary to enable effective contingency planning. No-one knows what the impact of the problem will be. But it is necessary to make an assessment of outcomes so that organisations can implement effective contingency planning.  The current focus on 1st Jan 2000, although understandable, is simplistic and unrealistic.  A Failure Curve can be constructed of computer system failures and disruption occurring over time. Problems have occurred already such as credit card expiry dates and pension maturity dates. Some are happening now. There will be a series of failures occurring over time, resulting in a Failure Curve. The key questions are what is the shape of the curve, and when do system failures translate into noticeable disruption.  Only rollover errors will occur at midnight 31st Dec 1999. Many other errors, such as implementation problems, will occur earlier or later.  Organisations have less time than many think to complete their planning. They will begin to experience problems before the end of 1999 and need therefore to have finished their programme and have contingency plans in place well before then.  There is a drag factor between computer errors occurring and disruption becoming noticeable. Some errors may not be detected for some time until the process that has failed, is needed or is used. It will also be possible to deal internally with some errors so that disruption may not occur for some time. Problems could thus continue well into 2000.  It will become increasingly difficult for organisations to prevent disruption by managing internal failures. The Gartner Group has estimated that 60 % of failures will occur in 1999. Action 2000 has said that 1 in 10 organisations have suffered problems already. Yet there has so far been little noticeable disruption since each failure has been manageable. This will be increasingly difficult as the rate of occurrence increases.  Failures will increase throughout 1999, likely to result in noticeable disruption later in 1999 and steadily increasing into 2000. As we approach the end of the year, more business processes will begin to deal with year 2000 dates and may fail. Implementation failures will also occur as new products and upgraded systems are brought into use. The occurrence of failures is likely therefore to increase as we get closer to 2000. The drag factor, however, will mean disruption will begin later and steadily increase into 2000 as more and more failures are encountered.  Death by a Thousand Cuts is more of a threat than one single catastrophic failure. It will be this build up of problems from multiple failures, rather than a single catastrophic event, that poses the greatest threat.

Introduction As industry has tracked through the general Year 2000 process, from disbelief/anger to acceptance, inventory discovery, remediation, etc, so the spotlight has moved over time to now arrive at assessment of external dependencies, predicted outcomes and contingency planning. Judgements on the second of these items will influence the third and the issue of predicting outcomes is the focus of this article. The motive for trying to understand outcomes, globally, is not idle speculation. A better understanding of likely outcomes enables more refined contingency planning. Contingency plans can entail very significant cost, which is generally operational cost and thus comes directly off the bottom line; so it is not just a question of ensuring adequate contingency measures but also of avoiding excessive and unnecessary ones. There is also the question of of the timing of contingency plans and millennium operating regimes, to which we will return. The popular focus is on what will happen on the 1st of January 2000 (other than celebrations). Since the problem centres on century date change, this focus is natural. However, it is also, in our view, simplistic and unrealistic. The only simple answer to what will happen is 42 and, once again, mice (albeit of a different kind) are involved. The approach to prediction that we take here is an analysis that attempts to identify the principal factors at play and which we think could be used as input to contingency planning or perhaps more ambitiously as the basis for a putative predictive model. Following the reasoning alone may identify aspects of the problem not previously considered. A Failure Curve If date logic failures occur on the 1st of January 2000, the only type that necessarily occur then are rollover errors. Every other kind of date logic error is more likely to occur before or after that date. Even rollover errors occurring on that date may not be detected for some time afterwards. So what we should expect is some extended period over which failures will occur. In fact, if we think for a moment, we already know that this is the case because errors of one kind or another have been reported for several years. The earliest of them occurred typically in life assurance and savings and loan applications but there have been others, occurring where business or administrative cycles extend over several years. We also know, as in the case for instance of New Zealand Aluminium Smelters, that logic errors may not be detected for up to a year (theoretically longer) after they occur. So, what we are facing is a failure curve, extending over a period of years, not a sudden doomsday. That leads to the obvious question: what is the shape of the failure curve? Trying to plot the shape of the failure curve is extremely difficult and involves making a number of very broad assumptions, particularly as most of the data required is, for all practical purposes, unknowable. If our reasoning is more or less correct, however, there is an inevitable and important conclusion: it is that you need your contingency plans drawn up, tested and practiced sooner than you probably think.

5 Types Of Failure One of the many complicating factors in the Year 2000 problem is that the process of remediation is almost as fraught with the potential for failure as date logic itself. To understand the likely consequences of the whole problem we therefore have to take into account all types of failure, not just those occurring through date errors. We would argue that the point of primary interest is whether systems/applications will continue to function as previously rather than why they do or do not. Therefore, whether failures result from date logic errors or some other cause attributable to the remediation process is in the first instance immaterial. So in examining whether systems are likely to fail or not, we have included all likely causes of failure within our assessment. The types of error we will consider are date-logic errors, new errors introduced through remediation, configuration management errors and implementation overruns. Date Logic Errors Weve divided date logic errors into two broad categories: rollover, leap-year and special meaning date errors and other logic errors. The distinguishing characteristics are that rollover, leap-year and special meaning errors necessarily occur precisely at the change of century, the 29th February or when special meaning dates are encountered respectively whilst other date-logic errors occur at less easily identifiable times, depending on various factors which we will try to identify as we go along. Both types of date-logic error may be identified when they occur or later. There is no more to be said about rollover errors at this stage but we will briefly discuss other types of date logic error. The other category divides more or less neatly into two:  Errors that cause the system/application to reject data or transactions  Errors that corrupt data but are discoverable only when detected by some extraneous process (e.g. customer query). The distinguishing characteristic between these two categories is that the former type of error causes the process to halt (and thus the error to be detected) when the first invalid transaction occurs whereas detection of the latter type of error is largely a matter of chance. New Errors Introduced There is some evidence, primarily from the USA, that the largest number of errors detected by Year 2000 testing are not date logic errors but errors of other types. These may be errors already known but disregarded previously because of their minor significance or new errors introduced as a result of remediation measures. The number is difficult to quantify in any particular case because it depends on the state of the code when remediation started and the skill of the programmers in remediation. However, the possibility of failure for this reason is significant and authorities on code development/maintenance, such as Capers Jones in the USA and Durham Systems Management in the UK, have knowledge of the probabilities that can be applied.

Configuration Management Errors There is similar evidence from testing that configuration management errors constitute a large number of the errors detected. These can be of two types: firstly, those arising through wrong versions/releases of application code receiving attention and, secondly, because of wrong versions/releases of utilities or other infrastructure software being included in overall change packages. Given the number of re-installs, upgrades and replacements scheduled in, in most cases, a short space of time, it is only to be expected that a number of errors of this type will occur. Evidence from tests already conducted supports this assertion but we know of no research that could place a probability factor on it. Implementation Overrun This is more a failure than a type of error but nonetheless important; it may, in fact, turn out to be the most significant type of failure of all in Year 2000 scenarios. A significant part of most Year 2000 programmes is a strategy to upgrade or replace rather than correct some systems/applications. Wise virgins will have included such projects within their Year 2000 programmes but many, to our certain knowledge, have not. Whilst the process of upgrading or replacement of packages or platforms is well understood at the IT level, the same is not true at Board of Director level, which allows misunderstanding of time and risk factors between IT and the Board. And the historical record of delivery of such projects on time is ominous. This is particularly so in the case of new development, be it bespoke or of new packages, but any kind of large project poses a time risk. To this we would add a point on implementation or rollout. For the much maligned mainframe-terminal model, this is a routine and simple exercise, involving little more than switching software from development/test to production directories. It is rather different in the large client-server environment: an implementation time of hours or days in the mainframe-terminal environment can easily extend to weeks or months in an equivalent client-server environment. There is little evidence that this point is understood at Board level in many organisations. The Social Dimension: Dysfunctional Behaviour There is another kind of error which we must introduce into the picture but which has nothing to do with the technical nature of the Year 2000 problem: public reaction, and the effect that may have on outcomes. Some years ago, the UK experienced a shortage of sugar for no better reason than that there was a rumour there would be a shortage of sugar. Because of the rumour, a sudden and unforeseen panic demand for sugar resulted in a shortage of supply. In the event, there was no shortage other than that caused by the inability of the supply chain to deliver in the demanded time-scale. Thus a rumour became a self-fulfiling prophesy. The Year 2000 situation has a great deal of potential for similar situations to apply, not just for food supply but also much more significantly for money supply or other fundamentals that rely on public confidence for stability. Indeed, financial confidence (and consumer reaction) is currently one of the great unknowns with huge potential for disruption. We mention this point only because the possibility of dysfunctional public behaviour has the potential to disrupt many otherwise unaffected business processes and thus needs to be factored in to any predicted Year 2000 scenarios. 6

When Errors Will Occur Having identified the principal types of error likely to apply to the Year 2000 situation, we now need to identify when they are likely to occur. There is less inevitable randomness in occurrence than may at first be suspected; however, from a global perspective, for all practical purposes the relevant data is so unlikely to be knowable that it may have to be treated as though it were random in some way. What we can do is to identify dates around which many date-logic or other errors/failures are likely to occur and regard these as likely error cluster points. So that is what we will attempt to do. We have already pointed out that rollover errors occur necessarily at the change of century so we can expect a cluster of errors then. We can in fact posit many other clusters of failures based on key dates identified for Year 2000 testing. These will (should) include all the standard dates that were identified in your test plans, such as the obvious 01.01.2000 and 29.02.2000, even if you dont eventually get around to using them all in tests. Another set of dates in the same category around which clusters of errors may be expected are the beginnings of business or administrative process cycles, as soon as the cycle extends into the next century. The earliest reported errors tend to confirm this. Cycles may be anything between weekly and annual and would normally be identified in test plans. The dates at which process cycles begin will differ from organisation to organisation so, for global estimation purposes, some very general assumptions need to be made. We could reasonably assume, for instance, that although financial years can begin on essentially any date, most in practice will align with calendar quarters. Additionally, we can note that the new financial year for most public sector organisations in the UK begins in April 1999. There is one more assumption we would suggest: that more processes have short cycles than have long ones. This is arguable but consider the following reasoning. It is easy to accept that more processes have a one-year cycle than a five-year cycle; whether more processes have a quarterly cycle than a semi-annual cycle (and more a semi-annual cycle than a one -year cycle) may be more difficult to argue. However, we would expect that where the event horizon for a process with a one-year cycle occurs before the application has been fixed, a first measure would be to artificially reduce the length of the cycle. For instance, some American States that issue driving licences with a five-year validity but had not fixed the relevant software in time simply reduced the validity period. Similarly, for instance, annual budget forecasting applications that would otherwise fail could, in the extreme, be modified to take account of an artificially truncated period of time. Date-logic errors are not our only concern, as we have already pointed out. Errors introduced through remediation and configuration management errors should be detected when the system/application is tested. Failing that, configuration management errors are likely to show up when the system/application is (re)installed. Also, given the abysmal general record of IT in delivering large projects on time, we should expect that some number of major upgrade/replacement projects with Year 2000 implications will not be delivered on time. Indeed, if the historical track record is maintained, the majority will be late. It has been argued that Year 2000 7

projects are more like maintenance projects than new developments and that therefore the historical record of delivery is irrelevant. We do not agree. A major reason for late delivery of new developments relates to specification changes, which should not apply in the Year 2000 context; that is agreed. However, the degree of testing required and the scale of change overall in Year 2000 projects is quite different from that involved in new developments and could easily confirm the historical record, albeit for different reasons. Moreover, upgrade/replacement projects already in train, such as a major platform upgrade at a bank and a replacement switchboard at a hospital, have already resulted in press reports demonstrating their disruption potential. The problem here is to try to generalise on when, globally, such projects are scheduled to be delivered. The following reasoning may help. Organisations leading the field have been delivering such projects over the past year, given their general target (albeit missed in most cases) of finishing internal work by the end of 1998. They should have fewer large projects left to deliver by the end of 1998 and so, even where overruns occur, they should be very few in number and, to that extent, more easily recoverable. For large organisations significantly behind the leading edge, the assumptions should probably be quite different. These face the same immutable deadlines and so will have had, of necessity, to shoe-horn schedules into the shorter time remaining. The obvious inference is that these organisations will have many large projects with delivery dates concentrated in a short period of time. If we focus on application event horizons rather than the populist deadline, that period is likely to be the first half of 1999; and, with most projects scheduled by necessity for the latest acceptable delivery date, the preponderance of scheduled delivery dates is likely to be in the second quarter. Finally, if dysfunctional behaviour by the public is to be brought into the picture, that will happen and the effects will be felt around the turn of the century. When Errors Will Be Detected The date at which errors occur is not necessarily, of course, the date on which they will be detected. We therefore need to introduce some drag factor between the dates on which errors may be expected to occur and the dates on which discovery of the errors may be anticipated. Once again, it may be helpful to unravel the possibilities in terms of types of error or failure. Most date logic errors in IT systems dont of themselves cause the system/application to fail; they merely corrupt data. Some organisations aware of this are planning calibration testing to try to detect data corruption if it occurs. Many factors contribute to exactly what happens so it is perhaps worthwhile spending a little time examining what may occur and why from some examples. One of the most clear-cut cases is where an application refuses to accept transactions that include 21st century dates. The error is immediately apparent. To this category of immediately detected errors we can add all instances of late delivery of projects. Clearly, the fact of the failure (if not the remedy) will be immediately apparent. If the application is not compliant but nonetheless accepts 21st century dates on input, the next question is whether any subsequent validation/audit logic will indicate a problem. For instance, in the putative

9999 = end of file problem, the file run would terminate the first time that this value was encountered in the relevant record field. Assuming that this is before the end of the file, whether or not the fact that some records have not been processed is detected will depend on what other validation routines are applied to the run. Potentially, the failure could continue undetected for some time. Manual intercession (human common sense) may allow other types of date-logic error to be detected more or less immediately. For instance, visual inspection of spreadsheets may show some cells unexpectedly displayed with a series of slashes or coloured red. Similarly, in another instance of dates with special meanings, where 31.12.99 is assigned the meaning of sine die the results of time having restarted may be immediately obvious. And, finally, customers may be expected to notice in some short space of time, if not immediately, if they are presented with inaccurate accounts or other output. These points are difficult to summarise in terms of a general drag factor. However, it does seem likely that most errors will be detected within a short space of time after they occur, possibly a matter of weeks, if they are not detected immediately. The nightmares will be where data corruption is minor but cumulative, significant over time but not obvious at any point in time until the error becomes large enough to be obvious. Also, as the New Zealand Aluminium Smelters case illustrates, errors may lie unidentified for a long time if they occur at the beginning of a long process cycle but can be detected only by some end-of-process routine. The Probability That Errors/Failures Will Occur Yet another ingredient we must add is the probability that errors will occur. However, horrendous the consequences of some potential failures, they may contribute little to eventual outcomes if the probability of their occurring is very small. By contrast, many failures of individually minor consequence but high probability, occurring more or less simultaneously, could materially affect outcomes. Some certainties and probabilities can be identified by continuing our type of error perspective, although we suggest that this may not be the best approach here. For instance, it can be asserted with certainty that any application with the potential to fail will do so if not fixed by the time of its event horizon. We can also look at the 29th of February 1996 for evidence of what may occur on the 29th of February 2000. The former date, unlike the latter, was incontrovertibly a leap day and nonetheless resulted in at least four headline failures. These were reported, in the UK media, as occurring at the Brussels Bourse, the Meteorological Office, New Zealand Aluminium Smelters and Papworth Hospital. It would be reasonable to assume that other unreported failures also occurred; and it would need to be a brave person who would predict fewer failures on 29.02.2000. However, we would suggest that the way to approach assessment of probabilities is by organisation as much as by type of error. The distinction to be made amongst organisations is between those that, very broadly, manage IT well and started early on their Year 2000 programmes and those that do not and started late. Although these distinctions imply a 2 x 2 matrix, we suspect from experience that good IT management and timely attention to the Year 2000 issue are likely to coincide (and vice-versa). A large part of our reason for this assertion is that good IT management usually coincides with a high degree of Board-level involvement, which is also a pre-requisite for timely and well managed Year 2000 programmes.

With that point in mind, we can assume that good organisations (in this context) will experience no errors of significant consequence unless they are unlucky. And luck will play a role. Some large financial organisation with whom we are acquainted have instituted a final quality assurance process after having remediated and tested code, before putting the code back into production. In some cases, they have discovered previously undetected dates and logic, albeit only a few amongst millions of lines of code. However, again in only very few cases, some of the errors found would have been show stoppers had they remained undetected. The point is that even best practice, well managed, is not fool-proof and that some, albeit very few, failures should be anticipated from unexpected sources. By contrast, large organisations late in starting and with less well managed IT will be much more susceptible to all the types of errors we have described. We would note particularly the following specific dangers. Firstly, there is the obvious possibility of inadequate identification of dates and date-logic through use of tools or methods whose limitations are not appreciated. This weakness will be compounded if, as is likely in such organisations, insufficient time and skills are applied to testing the results of remediation. Secondly, with respect to software packages and hardware platforms, there is the danger that claims of compliance have been misunderstood or inadequately tested. A full discussion of what compliance may or may not mean appears in Millennium Watch Volume 1 Issue 8, so we will not pursue the point here. Thirdly, control of the desktop is weak in most organisations and organisations not adept at managing IT will face more serious risks where critical applications run, at least partly, on the desktop. Finally, the probability of failure to deliver upgrade/replacement projects on time, and the probability that many such projects will be scheduled for delivery in a short period of time, must be much greater in this second category of organisation. The two classes of organisation are effectively on diverging risk spirals. We therefore think that the probability of errors/failures occurring is best dealt with by attributing probabilities to percentages of organisations in accordance with the above dichotomy. The lazy option would be to assume that 50% of organisations are in each category, although the historical record suggests that a 20:80 split (20% in the good category) might be more realistic. The Number Of Errors We need to add something on the number of errors likely to occur. We can find no useful way of distinguishing between different types of error here but would offer the following possibly helpful points. Industry averages suggest that the percentage of IT applications across the board that contain dates is 80% and that around half of these would cause significant failure if not corrected. If we assume a proportional relationship between the potential for failure and the number of failures occurring in practice, then 40 potential failures per hundred applications could be a figure to use. Of course, applications with the potential 10

to fail significantly could (and probably do) have multiple points of potential failure within them and that point could be factored in; or, at the global level of interest, it could be ignored. With respect to embedded systems, the relevant figures that we have are that some 30% of control systems need investigation and around 1%fail in some sense, with a third to a half of that number failing significantly. These figures are drawn from information we have from the oil/gas and telecommunications industries. There is a slight variation in the information we have from the water industry, with marginally fewer control systems suspicious (nearer 25%) and rather more failures of all kinds (2-3%). For global estimation purposes, these differences are probably irrelevant. We have no specific information to offer on, for instance, continuous process manufacturing. The number of non-date logic errors likely to occur has already been commented on above. Embedded Systems Failures Many of the above points apply to embedded as well as IT systems but there are several reasons for regarding embedded systems as a special case. For instance, organisations involved in continuous process manufacturing face a far greater risk from embedded systems than organisations whose work is largely administrative. There are other pertinent distinctions. Incidence of date logic in plant control systems is much less frequent than in pure IT systems and the incidence of potentially significant failures is also much lower. Also, although replacement projects for plant or equipment may be quite as costly and time-consuming as in IT (sometimes even more so), the risk of late delivery of a planned project is generally lower. Error occurrence and detection in embedded systems, similarly to that in IT systems, appears to have a marked dependence on process cycles, but particularly maintenance cycles in this case (although we are not aware of failures occurring at the beginning of maintenance cycles). However, health and safety issues predominate in the embedded systems world in a way that they do not do so preponderantly in IT. The most pertinent question is how to account for potential embedded systems failures in any view of outcomes. The much lower rate of significant failures relative to IT, but potentially increased significance when such failures occur, suggests introduction of a random element analogous to estimating (the reverse of) the chances of winning the national lottery. RTEL (Real-Time Engineering Ltd), a company specialising in this area, has calculated a significant statistical probability that around 20 incidents of the type experienced in Bhopal in India a few years ago will occur somewhere in the world. Their calculation is based on the number of chips in use, a conservative estimate of the number involved in critical processes and a generous estimate of the number that will be discovered and rectified. Where and when significant failures may occur are unknowns, however, and their prediction is just a statistical probability. Nonetheless, that approach, we suggest, is probably the best way to introduce embedded systems failures into the overall picture. 11

The Relationship Between Failures And Disruption Having thrown all types of failure, not just date-logic errors, into the pot, we now have to put forward the proposition that no kind of error/failure is relevant for our purposes unless it has the potential to cause significant disruption. We do this with some diffidence, since a large accumulation of failures of non-disruptive proportions may have a discernible negative effect on a national economy. The OECD has predicted an overall negative impact of 0.5% of GDP on industrialised western economies. As an example, we would propose the fact that many failures are known to have occurred already but with little disruptive effect yet evident. They may have been disruptive internally within an organisation and incurred unwanted cost but they have not been notably disruptive externally. Our focus here is on disruption that is visible externally. So, how do we approach prediction of externally visible disruptive effects? Because of our global perspective, the first question we have tried to answer is whether we can assume some proportional relationship between the number of failures to be expected and the amount of disruption. This assumption is obviously untenable in any specific case but may be legitimate globally. Keys to understanding disruptive potential must be not just the impact of the failure but also the time to recovery. Let us start by examining some examples. If a desktop spreadsheet application fails, the impact is more likely to be local than extensive and recovery time is likely to be short, probably a matter of hours. If a single undetected date in a major application causes a failure, disruption may be more pronounced but could still be containable within the organisation. Recovery time might reasonably be estimated at around three days, the time to take the application out of production, apply a fix, test the fix and put the application back into production. At the other end of the scale, we have some fairly obvious examples with major disruptive potential. One of these is the kind of application typified by air traffic control systems, which are programmed at a level intimate to the hardware on which they run. This means that they tend to run on old hardware and cannot easily be ported onto new compliant platforms. Replacement systems typically take years to develop and install and any failure in this type of application will not only have significant potential to disrupt but could also entail a very long period to recovery. In between, we have potential failures in the implementation of replacement packages or systems and major upgrades. Here, the potential for disruption through late implementation is high and the recovery period, based on experience, is likely to be months (even years in some cases) rather than days. We would add an observation on non-critical applications. It is common (and reasonable) practice to divide systems/applications into critical and non-critical categories, with the former obviously taking priority. However, this is a simplistic view of what is effectively a continuum of importance that runs through any inventory of systems/applications: the demarcation line between critical and non-critical is therefore to some extent arbitrary. Where Year 2000 programmes started in good time, the line should have been drawn so that all systems/applications with the potential for disruption fell into the critical category. In that case, non-critical systems/applications can be assumed to be non-disruptive, even if they fail.

However, it would be dangerous to make the same assumption where Year 2000 programmes started late. In some cases, particularly in the public sector, it is clear that the need for severe prioritisation is forcing the demarcation line to be drawn higher and higher, pushing highly important applications down into the non-critical category. Once in that category, these applications become more likely to fail since they will receive lower priority for resources; and if they are highly important, they may well cause significant disruption. Finally, we need to factor in the possibility that some critical facility, with local or systemic impact, will fail disastrously. Obvious possibilities are nuclear or chemical plant or some key central Utility, such as the National Grid. Here the potential for disruption is large and obvious, even though the time to recovery may be relatively short and the probability of occurrence may be very low. There is one further aspect of the relationship between failure and disruption we would just note here: a kind of congestion which could occur in two ways. If multiple disruptive failures overlap in time in a single organisation, the amount of disruption will almost certainly exceed the sum of the disruptive effects of individual failures. And the same could be true if multiple disruptive failures overlapped in time within a single supply chain or some other network of external dependencies. To allow for these cases, we need to introduce the possibility of some compound effect from disruptive failures, where they overlap in time. To summarise, we think the wide variation in disruptive effects consequent on different types of failure means that a purely proportional relationship between numbers of failures and degree of disruption should be avoided if a better practical formula can be found. Therefore, for the moment, we choose just to note the above examples of relationships between failure and disruption and will now move on to summarise our arguments and try to suggest some practical way of forecasting outcomes.

Summary Of Reasoning In our reasoning, we have tried to isolate and have discussed the most relevant elements of a potential predictive model: types of error/failure, when they may be expected to occur, when they will be detected, the number of errors occurring, the probability that they will occur and the relationship between failure and disruption. We think those are the key factors to be identified. We are conscious, however, of having introduced the need for either a lot data that may be, for all practical purposes, unknowable, or a lot of guesswork; and it is reasonable to ask whether the amount of guesswork required can produce any useful results. What we need is some hard data and/or probabilities known to be high and some way of simplifying the approach to the problem. The latter is key because, if the predictive model is to be useful, it needs to be created now. Time is, as ever with the Year 2000 issue, of the essence; and dealing with a lot of complexity takes time. As it happens, we think we have a useful answer, a kind of Occams razor : it lies in the number of re-installs and upgrade and replacement projects scheduled for the first half of 1999 in large organisations late in starting their Year 2000 programmes. 13

Simplifying The Problem Since the purpose of this exercise is to find some reasoned way for making useful prediction of Year 2000 outcomes, there is little point in focusing on unpredictable incidents. We have already suggested several types of potential error/failure in this category. If, on the other hand, we can identify some errors/failures with a high probability of occurrence and high disruption potential, these may be useful to set a baseline of probable disruption and may, in their cumulative effects, overshadow the disruptive effects of any other failures. As already indicated, we think a focus on re-installs and upgrade and replacement projects scheduled for the first half of 1999 provide the type of failure for which we are looking. Available industry averages can give us a fair grasp on probability of occurrence. Disruption potential can confidently be asserted in terms of weeks, if not months, and so disruption resulting from any single failure will be individually significant. Moreover, we can reasonably assert where such failures are likely to occur in a short period of time, creating congestion. Any large or IT-intensive organisation that started late on its Year 2000 programme will have many such projects due for delivery in the first half of 1999; the time available to such organisations predetermines this. So these organisations must have a high risk of experiencing congestion. And we can get a reasonable grasp on the number of lagging large organisations from reputable surveys. We believe that these points could be used to predict a baseline level of disruption with other factors superimposed via some appropriate randomising technique. Conclusion: And Two Worst Case Scenarios We hope the above has served to dispel the popular myth that it all happens on 1st January 2000 and has at least given a different and more realistic perspective on possible outcomes. We hope, additionally, that it may serve to warn of the possibility of needing contingency plans earlier than expected. Thirdly, we would highlight the concept of congestion, which we believe will be a particularly important factor in determining outcomes. There are several reasons for our focus on the concept of congestion, to which this analysis has given rise. Firstly, it is not just a matter of theory; we have already witnessed the initial stages of this process in some UK organisations. Secondly, although the potential for congestion is high in a relatively small although important percentage of UK organisations, it must be very much higher in countries lagging the UK in Year 2000 programmes; and that is most of western Europe, let alone the rest of the world. In those circumstances, it is difficult to believe that congestion will not occur on a widespread scale; and wherever it does occur, it is certain to be highly disruptive. Finally, we would offer two worst case scenarios which we believe have a high probability of occurrence. Well call them death by attrition and death by a thousand cuts. Death By Attrition The scenario we call death by attrition would occur in a company that experiences several failures in Year 2000 projects that overlap in time to recovery. In principle, it doesnt matter what the failures are but we have indicated above the type of failure most probable in this scenario and category of company to which it would occur. 14

Here is the script. A project to replace the finance system, one of a dozen within the Year 2000 programme to be delivered within three months, fails to go live on time. Time to recovery is estimated at two months but takes four. The project scope, already reduced to just five modules of the software, is fur ther reduced to three modules. Some staff are diverted to manual procedures, some temporary staff are hired and some reports and automated reconciliations are scrapped. A month later, another project delivery date is missed, more routine reports are scrapped, a semi-manual workaround is agreed and more temporary staff are hired. At this point, build up of paper documents and intake of temporary staff has filled all available office space, even with hot desking by many permanent staff. In addition, the backlog of paper documents for subsequent computer input is now unrecoverable by existing permanent staff within the financial year. A third failure occurs......... All this is, of course, purely hypothetical. Except that we already know of organisations in which the early stages of this scenario are happening and the projected failure rate, 3 out of 12 projects or 25%, is well below the normal failure rate for significant IT projects. The key here is overlap of second and third failures in the time to recovery of the first and the result is that the organisation grinds to a halt. It may not fail absolutely (cease trading) or may not do so immediately but is likely to be vulnerable for at least a year afterwards. Death By A Thousand Cuts This scenario is essentially just a distributed version of the former. The key again is overlap of some failures with the time to recovery from others (congestion) but the failures happen within some inter-trading or otherwise interdependent network of organisations. We believe that the probability of failures co-inciding in time within any interdependent network of organisations must be very high, although we cannot put a figure on it. However, that in itself does not matter; what matters, and the probability of which is difficult to quantify, is whether the set of failures will coalesce into some failure of the the chain of dependency. If it does, then potentially the whole chain experiences some level of failure and individual elements within it may suffer irrecoverable failure. A putative script would be a failure of finance systems in some companies in a supply chain. For all small companies, cashflow management is key to survival. Failure of payment systems in one or two cases, nervousness in creditor banks with warning flags already set and, perhaps, failure of one significant customer could easily be enough to set in train a catastrophic failure within the chain. Another putative script could occur in air transport. Virtually all individual failures in this tightly integrated industry produce an immediate deterioration in traffic flow but one that is generally recoverable within 2-3 days. Multiple individual failures, even of different types but overlapping in time at, say, multiple major European airports, would quickly produce a high level of chaos with a correspondingly lengthy recovery time. This scenario is highly speculative but we would suggest that the probability of failures of some sort overlapping in time in organisations within any interdependent network of significant size must be so high as to be almost certain. The key unknown is whether the unfortunate coalescence will occur. Other Scenarios We do not wish to dismiss other scenarios, such as terminal failure of a single organisation through a single catastrophic incident. Indeed, we feel that some such failures are almost certain to occur somewhere, given the widespread exposure to them. However, the probability of occurrence must be low in any particular case and the incidence is unpredictable. We suspect that more is to be gained by focusing on the scenarios we have suggested and on high probability, highly disruptive failures.

We would like to acknowledge helpful comments in constructing this paper received from a number of contacts including Dr Ross Anderson (Cambridge University), Dr Doug Morrison (Y2Ki) Graham Ride (Cybermetrix), Dr Martyn Thomas (Bristol University) and Dr David Walton (Durham University). Any mistakes are not theirs.



-- Gordon (g_gecko_69@hotmail.com), December 04, 1999.


Configuration management paragraph:

Fruedian Slip? Or is there something about VIRGINS the tech field of which I am unaware?

"Wise virgins will have included such projects within their Year 2000 programmes but many, to our certain knowledge, have not."

Really, thanks for the post, it is a truly fascinating article.

-- Hokie (nn@va.com), December 04, 1999.


Moderation questions? read the FAQ