How do we recover?

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

1. In the event that 2k arrives & systems do fail, what would it take to get systems back on line? 2. What is the actual process of making a computer 2k compliant? Can we simply purchase software or does someone actually need to go through each computer's code an change the appropriate information. Couldn't software be downloaded from the internet? 3. Is it possible that even systems that are thought to be 2K compliant or updated to be so, could still fail due to their own flaws? 4. Is there any thought on what would be more common, shut-downs or systems going haywire? 5. Won't this problem repeat itself in Feb. 2000 due to the leap year? Has this been addressed as a serious problem? What is being done about it? Thanks for the info!!

-- John Q. Public (BornFree.I@worldnet.att.net), May 06, 1998

Answers

Unfortunately, there is no straightforward answer to most of your questions, and this is one of the reasons for the anxiety many people feel regarding Y2K.

The fact is that there has never been a universal standard of software exception handling design; this means that no one knows for sure how dates are treated in every suite of software that is presently operating in systems around the world, and therefore no one can say with certainty whether the more common mode of failure would be shutdowns or haywire systems. The good news is that, the large majority of systems developed in the last 5 to 10 years are likely to handle complete four-digit dates rather than two-digit dates. The bad news is that there are probably many more of the older (two-digit date handling variety) systems in use, and it is certainly possible that many of the newer four-digit systems must interface with older systems in order to accomplish their day-to-day tasks.

While it is possible that commercial (i.e. PC / Mac) Y2K bugs could be addressed by software that is downloadble from the internet, I think it is widely agreed that these kinds of Y2K problems are likely to be more benign than Y2K problems in the more "custom" software that is written for embedded applications, mainframe computers, minicomputers, as well as specialized business applications and process control systems. These special applications will almost certainly require that the author of the software (or whoever is handling his legacy, if he is no longer with the organization--assuming the organization that developed the software is still in business, too!) to verify that the application is Y2K compliant. If the design is rigorous enough, the design documentation may provide enough information to determine whether the system is Y2K compliant, otherwise the code itself may have to be waded through, line by line. (Note: In my years of experience as a software engineer, it is highly unlikely that many systems are designed with sufficient documentation to make this kind of determination without actually wading through the code.) If you are not familiar with the software development / debug process, the thought of wading through millions of lines of encoded text in search of a Y2K bug may sound like searching for a needle in a haystack. In reality, while the process is tedious and time-consuming, it is really not quite so intimidating. This is because--generally--the engineer has many "clues" to look for (e.g. there are only a finite number of variables in any application that are date dependent, so isolating these variables and their coupling to other variables are essential to reducing the scope of the problem to a solvable level) and he also has software tools at his disposal to aid in repetitve tasks, such as searching for variable names.

The leap year issue is not likely to pose a special problem in the year 2000. A good deal of software has successfully run through a number of leap years without any problem, and there is nothing unique about the leap year which occurs in the year 2000. However, the non-leap year of 2100 may pose a challenge because most software calculates the leap year by a simple division by four; while this is generally sufficient, it is not really accurate and the year 2100 will be the next occurance of a non-leap year that occurs during a year that is divisible by four. Other time related problems similar to the Y2K problem (primarily in embedded systems) are likely to occur later in the 21st century due to 32-bit timers that count elapsed time since a particular year (1970 is a typical one that is used.) These timers are due to "roll over" not in the year 2000 but I think sometime in the 2030s (I actually took the time to calculate the exact year once, but I have since forgotten it!)

In any case, I believe that the impact of the Y2K problem will be far less catastrophic than most of the predictions in this forum. So, while I am concerned about the problems you raised and how they can be mitigated, I do not yet feel the intense anxiety that I sense among many other participants in this discussion group.

David Auslander

-- David Auslander (dauslander@usa.net), May 06, 1998.


Another aspect of the problem involves the credence we can place on the reports, particularly, anonymous reports that appear on this and other lists. I have Dan Cormier, Y2K Water Discussion Moderator, permission to reproduce his rules below. I heartily recommend it for this list also. Art Scott wrote :

>Will there be an effort to report factual information. One of my >concerns about anonymous reports is how much credence to place on them. >Regards, ArtS

You have a very good point here, Art. I can't agree with you more. The only way we will ever be able to solve our Y2K problems is by the gathering of factual information.

I am deeply committed to this. I consider all the misinformation that is circulating around to be the greatest threat to our efforts. The consequences of propagating unverified information is awesome :

1. It makes us loose precious time. As time is running out, the process of sorting facts from rumors becomes more and more burdensome.

2. Erroneous information may mislead someone into a long process of testing, here again stealing precious time from the real problems.

3. It undermines our credibility.

4. It destroys the effectiveness of our awareness effort. The Y2000 problem is real. It requires determination and time to be solved. Many however are still in the awareness/denial phase. Some water districts are still waiting for a budget to start their remediation process. Hyperbolic information will not increase awareness. It will only discourage the honest inquirer and give more arguments to those who proclaim that the whole issue is nothing more than an IT specialists set-up to draw a lot of money to themselves.

For this reason I am carefully checking every bit of information I can find. Unfortunately most of it has been proved either unreliable or completely false. The only reliable information I have so far is from water districts that are members of our private discussion list. And I have no choice but to respect their wish of confidentiality.

As long as we don't know of any water district that has completed its compliance tests, we still don't know if we will have drinkable water in year 2000. That's the only hard fact I have so far. Wish I were wrong.

Daniel Cormier 05.02.98

-- Art Scott (Art.Scott@marist.edu), May 06, 1998.


I have been conducting a seminar on Year 2000 problems for senior citizens. In a couple of weeks we tackle the "embedded systems" section. Towards that end I have been writing a paper to get my own head straight and to figure out a way to explain it to ordinary people who are not computer nerds but are literate, thinking people. I invite your comments on the plain text version of draft #4 below:

Embedded Systems that Include Clocks - draft #4

Purpose: The purpose of this essay is to help get my own thinking straight and to help ordinary folks i.e. non-computer-nerds, to understand the issues with embedded systems. The essay will address the construction of embedded systems, the logic of how clocks work, the conditions that make a Year 2000 problem, what makes embedded systems different, what makes the Year 2000 problem different, and some testing and recovery strategies..

Construction of embedded systems: Embedded systems suffer all the same modes of failure as any other system plus some unique conditions that compound complexity of predicting and locating failed components.

There are a number of possible combinations of hardware, software, and firmware that have been employed in constructing embedded systems. The simplest and most difficult to change is the hardware-only chip where the logic is etched into the silicon. The only practical way to change the logic is to replace the chip with another which contains the new logic.

A more flexible embedded system adds software, memory that contains the instructions, to make the combination of hardware and software perform some function. A simple PC is an example of a system of hardware and software. How easy it is to change the software, therefore, to change the function of the embedded system, depends on many factors; some are easy to change, others more difficult.

A third variation includes some combination of hardware, software and firmware. Firmware is a kind of memory that contains the software in a quasi-permanent form. Microcode and BIOS are examples of firmware. All three types and combinations exist in machines today.

A more insidious design uses multiple layers of embedded systems, like a Russian Babushka doll. A designer may buy a full function chip off the shelf to incorporate into his design if the chip is less expensive than a new design to perform a lesser function. Thus a fully functional clock may find its way into a machine that does not utilize the full clock function.

The user may be unaware of the combinations used in the machines he owns. For example most users dont know how the clock works in the machine, some may not know there is a clock in the machine. Under most circumstances it is no more necessary to know how the clock works to get useful results from a machine than it is to know how a gasoline engine works to drive a car. Some people who own PCs dont know there is a battery in the PC that keeps the clock running when the machine is turned off. Some people dont know there may be a battery in other equipment e.g. VCR to remember the settings during a short power failure. Some people do not know how many embedded systems surround them nor where they are. How many embedded systems do you have in your home?

Logic in clocks in embedded systems Clocks are the bone of contention in embedded systems There are many date related bugs that come under the heading of Year 2000 (Y2K); not all of them fail on Jan 1, 2000 but some will. The first necessary part of a clock is an accurate source of pulses. In a mechanical clock a pendulum or flywheel performs this function; in an embedded chip its an oscillator, perhaps crystal controlled for accuracy. For simplicity let the oscillator run at 60 megahertz (millions of pulses per second). By design our clock will display seconds, minutes, hours, day of the month, the month and year; by convention the display will be in reverse order i.e. seconds on the right end, thus:

(Sorry, the diagram shows boxes with "carry pulses" from left to right) Osc.

Year Month Day Hour Minute Second 60MHz

Initial conditions: On power on the clock assumes some time and date determined by the designer.

Logic: (Each box is a simple counter, counting and displaying input pulses.) 1. Every 60 million pulses forces 1 carry pulse into Second 2. Every 60 pulses into Second forces 1 carry pulse into Minute, reset Second = 0 3. Every 60 pulses into Minute forces 1 carry pulse into Hour, reset Minute = 0 4. Every 24 pulses into Hour forces 1 carry pulse into Day, reset Hour = 0 5. Now things get more complicated: 7 If (Month = January) and (Day = 31) then force carry pulse into Month and reset Day = 1 7 If (Month = February) and (Year not = leap year) and (Day = 28) then force carry pulse into Month and reset Day = 1 7 If (Month = February) and (Year = leap year) and (Day = 29) then force carry pulse into Month and reset Day = 1

The remainder of the logic is left to the student as an exercise! Other logic trees, perhaps more efficient than this one, could be implemented. The rules for what is a leap year are complex and mistakes have been made in past implementations .

Carry pulses propagate from stage to stage somewhat slower than the speed of light so that some time must be allowed for the maximum carries to propagate and the clock to settle down before the contents can be read out.

-- Art Scott (Art.Scott@marist.edu), May 06, 1998.


(I had to break the message in parts.)

Consider this most important problem with Year. In the early days hardware was so expensive strategies were developed to minimize the amount of hardware required. It was considered elegant programming to minimize the number of steps required to carry out a procedure. One of the obvious ways to save hardware, storage space and execution time was to record only the last two digits of the year. Everybody knew that 68 stood for 1968 just as everybody knows that 98 stands for 1998! For years this strategy worked; saved time, space and money; on some projects it probably made the difference between doing it and not doing it.

As we approach the millennium the strategy backfires. What happens when the two-digit Year rolls over to 00? Unfortunately, it depends on what the designer decided. As an example and a concern for every PC owner, some PCs roll over to January 4, 1980, some to January 4, 1984.2 Each owner must find out what his PC does. When thats done, he must find out what the software designer did about using and storing years in the operating system and the application code. Conditions for embedded systems to be a Year 2000 problem

The problems are more complex for embedded systems, e.g. where are all the embedded systems? Were inventories ever made of all the embedded systems that came with new sophisticated equipment? Probably not. Did anyone dig down and determine the second and third level source of embedded logic? Probably not. Yet all of these must be located if we are to pre-empt disruptions. Can we sort any of this out to save some time and effort?

Some necessary conditions for failure on Jan 1, 2000: 1. Equipment must accurately count to the Year 2000. 2. Equipment must be continuously powered either with an on-board battery or continuously connected to a wall source. Ive seen it asserted that no rational person would design equipment that required such maintenance. Many of us bought PCs without ever asking whether an on-board battery existed. I doubt that most user manuals even mention it, to say nothing of its specifications or how to replace it. 3. The clock must be settable to todays date and time. Would anyone design a clock that couldnt be set? Probably not but such a chip could Babushka-like be buried n-levels deep and trigger a failure at some random time after Jan. 1, 2000. 4. If the equipment relies on the date function, then you must investigate what happens to that equipment on 99 to 00 rollover. 5. If the equipment does not rely on the date function: 7 If the equipment uses two sequential clock readings to measure time, 1 large difference exists when the clock rolls over. Weve seen above that it takes time for the clock to settle down before its contents can be read out. It requires more time to read out the contents and find the difference between the old clock reading and the new clock reading. It requires more time to determine if the difference is less than, equal to, or greater than the desired reference time. The sum of all the required times might even be variable. Depending on the rate the designer chooses to sample the clock, the rate at which the clock is being updated, and the amount of time it requires to do the calculations, there is some probability that the difference is clock readings will not be precisely the same as the reference time as the real time difference passes by. As a consequence the designer could not use equal to but could use greater than or less than as the decision criteria. How frequently will this cause a problem? If the designer used greater than as he decision criteria and the 99 to 00 rollover produces a large difference, the criteria is met too early, then the operation continues normally. If the designer used less than as the decision criteria, then the criteria is never met, potentially producing a problem. If you must pre-empt any disruption, then you must investigate what happens to that equipment on 99 to 00 rollover. 7 If x clock pulses are used to measure time, there is probably no Year 2000 problem. 7 If the equipment has a fully functional clock buried within, which is not (re)settable and/or does not have a continuous power supply, then the chip may cause a problem when it calculates that its Jan. 1, 2000 but actually at some random time after Jan, 1, 2000.

What makes embedded systems different 1. Inventories of embedded systems do not generally exist. Location of some impending problems is unknown. 2. Some embedded systems are not physically accessible. 3. Owners of the embedded equipment do not have access to the source code for the software or firmware. Without this knowledge, the owner cannot inspect nor correct the problem. 4. The software is often burned in to the chip, which requires a new part being identified, a manufacturer identified, part ordered, shipped, tested, installed and system tested. Some replacements will not be available or not available on time. 5. Some embedded controllers of the same make and model are manufactured using batches of chips bought on the spot market. In the middle of a run the factory may change from brand A to brand B chips. In theory A & B are identical but in practice they are mostly alike. For critical applications its not sufficient to say Brand X model Y is y2k compliant by testing only one sample. 6. How can you be sure new purchased embedded systems are y2k compliant? If the application is critical then they must be tested! What makes Year 2000 problem different 1. Many embedded systems will roll over at the same time. Shortly after 23:59:59 Greenwich Mean Time (whatever its now called) the rollover from 99 to 00 will begin. 2. Then at one hour intervals, as each time zone rolls over to the new year, problems will manifest themselves. 7 Some untested embedded systems, 7 some untestable embedded systems, 7 some inaccessible embedded systems, 7 some undiscovered embedded systems, 7 some tested but not replaced embedded systems will begin making their presence known. Trouble-shooting one bug at a time is difficult enough. On Jan 1, 2000 we will have multiple problems to handle at one time. How many? No one knows. Estimates vary from 1/10 of 1%, to 3 - 5%, to 11% of 25 billion embedded systems expected to be installed on January 1, 2000. Testing & recovery strategies The foregoing analysis leads to the conclusion that we need both a proactive strategy to pre-empt as many problems as possible and a reactive strategy or contingency plan to cope with the problems that are not or cannot be found prior to Jan. 1, 2000.

Some applications are deemed so critical e.g. nuclear plants, airplanes, trains, power plants, that a major effort is demanded to investigate, correct and test all embedded systems well before Jan. 1, 2000. Some applications, where the risk/reward ratio is lower, will get lower priority, less attention. Available time, money, skilled manpower and management beliefs will determine how many embedded systems are located, tested and corrected prior to Jan. 1, 2000. Some people assert there is not enough time and skilled manpower to correct all embedded systems before Jan. 1, 2000.

Therefore, some of the untested, untestable, inaccessible, undiscovered, and tested-but-not-replaced embedded systems will fail on and after Jan. 1, 2000. Contingency plans, will have to be developed, put in place and manned on and after Jan. 1, 2000 to cope with these failures.

Acknowledgments Primary thanks go to Dick Mills of Amsterdam, NY, home page: http://www.albany.net/~dmills for his critique and suggestions re this paper.

Art Scott May 6, 1998

(There are some Endnotes on DOS clocks, leap years, etc.)

-- Art Scott (Art.Scott@marist.edu), May 06, 1998.


Sorry, guys, the system concatenates all the carefully spaced and formatted presentation. Does anyone know how to copy text to the bbs and keep it readible?

-- Art Scott (Art.Scott@marist.edu), May 06, 1998.


Moderation questions? read the FAQ