Utilities and embedded systems; more confirmation of bad news.

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

copied from the web:

Response to Crouch/Echlin and Power Utilities - Is this being considered? Date: 1998-10-30

_Any_ IBM PC/AT based embedded systems will almost certainly cause problems for _any_ application in which it is used. I have yet to hear of any system involving more than 3 PCs that did not have to be remediated. In other words if you have any moderately sophisticated embedded system controlling or monitoring _any_ process that involves more than 3 IBM PC/AT compatible boxes the chances are greater than 90% that the system will fail. Period.

Don't pay any heed to these people who say "no problem". I have found that they fall into one of three broad catagories; they are either just plain ignorant and like the sound of their own voice and spouting off on the 'net, or they are not completely ignorant but in denial, or, and the most dangerous class, they do understand the problem but they are over specialised and do not recognise nor understand the scope of the problem.

For a list of realtime clock chips that HAVE the Y2K bug and will NOT be fixed check here: http://www.mot-sps.com/y2k/yr2knote.html and here: http://www.mot-sps.com/y2k/black_prod.html

These chips are used in hundreds of millions of IBM PC/AT compatible embedded systems ( and hundreds of millions of IBM PC/AT compatible desktop systems too ) . These chips are also used in hundreds of millions of other embedded systems of which I know absolutely nothing. I am an IBM PC/AT embedded systems expert ( But not overly specialised

Check here to see what an IBM PC/AT compatible embedded CPU board looks like: http://www.ampro.com/products/coremod/cm-p5i.htm That board measures 3.5" x 3.25" you can put them everywhere, and people do. Also notice what they say about Y2K ( hint: nothing ) .

Check here for an excellent page describing Y2K problems and embedded PCs: http://www.qnx.com/support/y2k/index.html (it also gives a pretty good feeling for the complexity of the issue)

Check here for a "short" list of industries and applications that use embedded PCs: http://www.qnx.com/realworld/index.html and here: http://www.qnx.com/company/compover.html#Customers

As I have mentioned many times before there are dozens of embedded OS manufacturers and dozens embedded hardware platforms for the IBM PC/AT compatible market alone. There are also literaly hundreds, maybe thousands of custom/in-house/proprietary embedded kernels and OS's too.

Something else to think about, once you've got your head wrapped around this part of the story remind yourself that the embedded IBM PC/AT compatible market is a small fraction of the overall embedded systems market and hence a small part of the embedded systems Y2K problem.

See my comments in the thread "Gartner Report ( 98-10-12 ) " in this forum for more info. http://www.greenspun.com/bboard/q-and-a-fetch-msg.tcl?msg_id=000BzM

Feel free to email me if you'd like further clarification on any of this.

Finally, don't let anyone tell that there is no problem with embedded PCs. They simply do not know what they are talking about.

Regards, Andrew J. Edgar Manager, Systems Software Centigram Communications Corp.

Disclaimer: I speak only for myself and from my personal experience. I do not speak as any kind of representative, nor spokesperson of my employer.

-- Goldi (goldilucks@yahoo.com), November 02, 1998

Answers

Goldi, your post is a good illustration of the way the embedded systems problem is exagerated by well-meaning generalists. I went to link which listed Motorola systems that are not Y2k compliant. Now let me explain why that is not the proof of disaster that the author suggested. I hope you don't mind if I speak from experience since my team has tested Motorola embedded systems and the results we got mirror those of my peers at other companies. The sytems which are not Y2k compliant internally that we tested have the following results: a. When the date is set to 12/31/99 and allowed to roll over to O1/01/00 the operating system decides that 00 is not a valid year and resets it to 80(as in 1980). b. The control systems which these embedded systems are part of continue to function as before with absolutely no problems because the "application software" doesn't give a damn about the date. Without a great deal of effort you can find similar testimonials. I know of folks in Electric utilities and manufacturing facilities who have had the same results. Now, I am not saying this will be the case everywhere these devices are used. If this device has been programmed such that the application for which it is used relies on the internal date for its functionality then that application has a Y2k problem. The important point here is that the author is grossly mistaken in projecting failures whereever internal clocks don't recognize 2000. He is just one of many who are shooting off their mouths based on information which is a mile wide but only an inch deep. The embedded systems problem is real and it is serious but the public is ill-served by reports which greatly exagerate the problem. Jack, don't try to understand this post - your head might explode.

-- Woe Is Me (wim@doom.net), November 02, 1998.

"copied from the web"

Please list the source URL when posting things like this.

-- Buddy Y. (DC) (buddy@bellatlantic.net), November 02, 1998.


Woe, I'm just glad that you didn't say that anyone ever claimed that embedded systems are sensitive to FISCAL year rollovers!!! Your post did do a pretty good job of pointing out the interaction between the "system" versus "application" parts, which validates the fundamental foundation of any Y2K plan: you HAVE to do an inventory and look at each piece. It may very well be that a non-Y2K compliant system will still work, if the application per se is not date sensitive, etc. (Or, for that matter, that the system date can simply be set back so that it always stays in the 20th century.) But you don't know until you check, which is a very time consuming task. And time is a commodity that is very precious at this late date. The result is that as we approach Year 2000, the embedded systems will clearly continue to be the "wildcard" of Y2K, and we will find out which ones have problems and which ones don't only when we get there. (BTW, my head did indeed explode last week, due to my brain being non-newage compliant.)

-- Jack (jsprat@eld.net), November 02, 1998.

This is an answer to a question I posted on euy2k.com. I also posted the question and this answer here on this forum a few days ago. I asked the question as a way to find out if this problem might come up in the area of utilities.

Woe, you're missing the point regarding this situation but that doesn't surprise me. I can't imagine you yourself have done any research regarding this possible problem, it's scope, impact and severity, etc.

It's a problem that hasn't become mainstream. Yet, this was true for embedded systems not too long ago.

Go to this url if you want to understand exactly what occurs:

http://www.intranet.ca/~mike.echlin/bestif/tdpaper.htm

Mike ==============================================================

-- Michael Taylor (mtdesign3@aol.com), November 02, 1998.


Woe, just to make it easier... you don't have to go to the link I'll bring the text to you from the link above ======================================================================

The Crouch Echlin effect, detailed

by Mike Echlin Yeovil Systems Research and Development. August 1998

Abstract.

As we all know there are going to be problems associated with the change to the next century, the "Year 2000 Problem." We have discovered a NEW year 2000 problem, the Crouch Echlin effect, that until now has been fairly unknown. The Crouch Echlin effect is a random jump in date and time, that occurs at random boots of affected computers and embedded systems only after rollover to year 2000. It can also be accompanied by a loss of some hardware and CMOS settings. The systems that show the Crouch Echlin effect have one thing in common, a real time clock (RTC)/CMOS that, if accessed during the once a second update cycle, gives bad data to the BIOS POST date/time routine. This effect has the potential of causing such problems as randomly crashing your accounting system, or in the case of extreme system failures, causing a plant or factory to stop production.

1.Introduction

a. it's going to be bad enough anyway

Yes, the year 2000 is coming, there isn't any way to stop time, so we have to prepare, and know what we are dealing with. There are 3 main groups of Year 2000 problems, they are:

* Rollover problems, * Date comparison problems, and * The computer just not doing the same thing before and after year 2000.

Rollover problems occur right at midnight year 2000 (or there about, when ever the device thinks it is midnight.) This is when the hardware can get confused about the date, and some computers will just quit at this time, others will just go to 1900 and possibly display an error value like 1980. This is a known problem, with known fixes.

Date comparison problems can occur in software anytime the software is comparing a date from before year 2000 with one after year 2000. If the century is not included in this comparison, or is included with a wrong value, the results will be incorrect, and your calculation will give a false answer, or just blow up. This also includes special dates, and leap year calculations. These problems must be fixed in the code, or the spreadsheet, or whatever they develop in. This is also a known problem, but we may not have enough time to fix all of these errors.

The third main group I mentioned above is the least known, and least prepared for group: computers not behaving the same way after the year 2000 as they did before. As a whole we have just assumed that once we got past the date problems from groups one and two, our computers would go on working just the same as they are now, giving the same answers back to the same questions, and doing the same thing as they are now. Boy were we wrong.

b. and now there's this new problem

While the expectation would be that computers will behave the same, do the same things, and have the same input/output after the millennium as prior to it, this does not appear to always be the case. On boot up, after year 2000, the computer may occasionally make a jump in its date or time. For example, you have been running your computer every day since January 1, 2000. It is now January 9. You have found and fixed most of the important Year 2000 problems you have seen. Yesterday when you shut off your computer the date and time were correct. Today you start your computer, but it reports the date as April 17, 2000, and the time is correct. It didn't happen before, but it will happen again. You have had an attack of the Crouch Echlin effect. It could have been worse. You might not have noticed it, and you may have worked all day in your accounting package with the wrong date, and saved files with the wrong date, storing them in archives with the wrong date, or even calculated payments or bills with the wrong dates, and wrong interest payable. Another side of the Crouch Echlin effect, although rare, is even more devastating: it may sometimes scramble the settings in your computer's CMOS memory, so that at the next startup you just can't start your computer or you may not be able to access some of the devices attached, or perhaps part of your computer, such as your hard drive, may not work correctly.

2. How this problem is important

a. potential consequences

On computing devices, irregular random jumps in date and time can lead to accounting date errors, file date errors, wrong dates being inserted in data, historical data being lost or misdated or scrambled, and plant synchronization loss, which could lead to possible plant control loss. It could be as simple as your Personal Information Manager getting set forward in date, and not allowing you to set it back again. Or your accounting system not allowing you to make entries because the date on the entries is before the date in the system, and if this is a jump in months or higher whole quarters of data may be lost as a result. In plant control systems a loss of synchronization may lead to data being stored with the wrong date/time, and if there is an event that has to be analyzed, finding what data goes with that event may be impossible, if the error is even noticed. If the error is not noticed then the errors in the data reported may lead to misrepresenting the causes of the event, causing improper conclusions to be formulated, and the actual error that caused the event may not be addressed, or the wrong measures applied to address the wrong problem, and a catastrophe may result if the event is of great enough significance. In a smart sensor type of device or PLC, this effect could cause improper inputs to the systems that are reading its data, this loss of integrity on the device means a loss of confidence in the whole system. On medical systems, nuclear generation, aircraft flight control, air traffic control, rail traffic control, electrical distribution control, and other safety critical systems, this loss of confidence means loss of service, and shutdown of plants or transportation/distribution systems.

b. how widespread?

This effect will affect devices in all classes of computers. It has been demonstrated on Intel and clone CPU based PC's and PC compatibles. It has also been demonstrated on other desktop platforms, as well as on workstation platforms. It has shown up on embedded systems, in some very hard to get at places. We are afraid it will show up in most computing platforms, on a significant percentage of individual devices of each platform.

c. potential danger in embedded systems

Embedded systems pose a singular problem. These devices may not display anything, but may be monitored by a terminal. They may or may not use the time/date, but if they scramble their settings they will not work at all. And if they use the date/time, especially for synchronizing with other devices, be it in the same plant or in another time zone, their time will not be synchronized, and any data they are collecting will be suspect and not usable. If the device is in a mechanism that internally tracks something like the time of last maintenance, then that maintenance schedule will be interrupted, or not performed. This can cause loss of schedule control which will cost extra dollars, or possibly cause plant shutdowns.

3. How this problem turned up

a. initial reports

This effect was initially reported by Jace Crouch. One day he was doing a routine year 2000 rollover test on his computer, and after that test he decided that this computer was not used for any date sensitive work anyway, so he decided to leave it set post year 2000 to see what would happen. He didn't expect anything to happen, but did the test any way. His procedure was to set the date beyond year 2000 and use the machine as normal, shutting it off when leaving his office, and turning it on when he came back. His results: "1. The system clock "ran" extremely rapidly. After two weeks, the system date was mid-December 2000. The date reported in CMOS and reportedly various WP and Microsoft applications was identical. Whatever date the RTC reported, the applications displayed. Files were saved with a 00 year date, but win 3.1 file manager displayed a :0 date. These files were readable, writable, and seemed otherwise normal.

2. After about ten days, the system would not recognize that I had two serial ports. For whatever reason, every system test that I ran reported only one serial port. This was the case with Norton Utilities 7, MSD, and an old WP utility. Nothing else started shutting down, but I figured that if a serial port went down, anything could be next, maybe trash the hard drive in some strange time-warp way.

3. Once I set the system clock back to the correct 1997 date, both serial ports were recognized, and they worked fine. I have no idea why the clock "ran" so fast in y2k, nor why the system stopped recognizing the second serial port. This was enough to convince me that even on a simple PC, the y2k problem can cause hardware failures." note: We now know it is not the RTC running fast that causes this problem.

b. confirmation

I heard of Jace's findings and decided to try to recreate his reults. Originally, Randall Bart called this effect "time dilation" ("TD") when we thought that somehow the computer was "compressing" the time. We later changed this to "time/date" keeping the initials "TD" since there was no real connection to the Einsteinian phenomenon. I defined a procedure to use for testing, and set about to see if TD was real or not. I used a number of machines to test on, including a 286, 2 386's and a 486-33. The procedure, set the date of the computer to a date/time in year 2000. Record this date/time, and the date time on a control clock to compare against to see if there was a jump in date/time. I recorded any differences between the computers date/time and the control date/time in a log. What I discovered was that I could recreate Jace's results, and on all three classes of machines that I was testing. Here is a log from the 286.

Set to 01.01.2000 9:50:36 at 10.16.1997 9:50:36 01.01.2000 10.01.30 at 10.16.1997 10.01.30 04.12.2000 19.12.25 at 10.25.1997 19.12.25 04.15.2000 21.43.09 at 10.28.1997 21.41.44 04.18.2000 20.28.48 at 10.31.1997 20.27.23 End log.

Explanation of log. The first two lines are to identify the computer, and to state the start information and control time. The next line shows that the date change held over the first reboot. The third line shows a 3 month 2 day jump, while the time stayed the same. This is a classic TD jump. The time has stayed the same but the date has been read wrong by the software in the BIOS and during its calculation this new date has been created. The fourth line shows that 3 days later the date although still 3 months 3 days out hasn't gone any farther out, but the time is now 1 min 25 seconds ahead. The date has stayed the same but the time has gone off, this time only by seconds, not a real problem for an accounting package, but certainly significant to any real time system. The final line shows that at the end of our 2 week test period the time and date have not changed any more. There you have 2 jumps in 2 weeks. The log file only has the anomalies in it. But the computer was powered on at least twice a day for the whole 2 weeks. In 30 or 40 power cycles the effect was seen twice.

Other Peoples Research Results. From Barry Pardee Americas Year 2000 Expertise Center Manager Digital Equipment Corporation "We at Digital have confirmed that TD is real and is a serious threat to PCs, servers, and embedded systems. Mike Echlin and/or Mark Slotnick can fill you in on some of the technical information. We are working with Mike on a diagnostic which will determine if a PC potentially has TD problems, or not. An automated fix for TD will be coming out in the near future." Mark Slotnick, at Digital Equipment Corporation, has been studying this effect since hearing of it in late 1997. He now has the original mother-board from Jace's original machine. (Now affectionately called "Zoom") Mark has this computer in his office, in a new case, with a new power supply, and new battery, new hard drive... With all of these new components, but the original mother board he has still reproduced the original effect. This independent confirmation of our results with such factors as power supply, battery, and peripherals replaced with new parts, confirms our findings, while removing the possibility that it could be any external influence that could be causing the effect.

c. development of discussion/tests

As we discussed first Jace's results, and then my confirmation of those results, a few theories developed as to how this could happen. First we thought that the RTC was running faster because it was after year 2000. This was very quickly discarded as we discovered that it didn't happen every time. Then we looked at the RTC/CMOS, to see if there was an overflow of a buffer, but looking at the registers of the RTC/CMOS showed that they weren't doing that. Then we discovered, by observation, that it wasn't happening while the computer was turned on, but possibly while it was off. Which lead us to look at either start up or shut down. Now a PC/clone doesn't do anything special on shut down from DOS, and all but one of our test computers were running DOS. This pointed strongly to start up. Also during this time I developed a program in Basic to read the RTC repeatedly and display this information to the screen. On all of the computers that showed this effect, anomalies appeared while running this program, but they were random, and the program was slow. So I developed a 'C' program to do the same thing, but it stored the values in memory until memory was full, then dumped those values into a file for later viewing. This program, known as rtc.exe showed us some very important clues, and lead to the development of the tests and everything we know today. It showed us that the RTCs in these computers gave back an error value of 0xff during the update cycle of the chip. It also showed that the computers that didn't show this effect did not give any error value during the update cycle. With the help of Tom Becker we found that this was because the RTCs that didn't show this error during the update cycle had double buffered RTCs. This lead to the rtstst.exe program that tested for the presence of an RTC that gave 0xff from any of the registers. It was at this time we also started a series of "benchmark" testing. The test procedure defined in b above, but with dates both prior to year 2000, as well as post year 2000 in two week cycles. This repeated testing showed that the effect happened on our test machines only after year 2000.

4. Current state of research

* The RTC in a PC is based on the Motorola 146818 design of CMOS chip. This chip is documented to give an error condition when it is in an update state, this error is the return of an 'ff' instead of the binary coded decimal value for seconds minutes hours day month or year. Other designs of compatible chips may give other values for an error value, and some just give the contents of the register currently being updated to any request for any of the registers. * The BIOS when reading this first reads the status byte of the RTC to see if it is an update state. If this is the case it waits for the status to return to normal, if not it goes ahead and reads. * If this read of the status byte occurs just before it goes to bad status, the BIOS code now has 244 microseconds to read the data before it goes to error condition. * Our testing has shown 4 things: 1: The effect only happens post rollover to Y2k. This was shown by the repeated pre/post 2 week cycles of testing. 2: The effect is characterized by random time/date jumps occuring only at startup of the computer. 3: The one thing in common with these computers is a non-buffered Real Time Clock. (By observation, and correlation of those observations) 4: When the time/date jumps and is wrong at the OS clock level, the RTC still has the correct time. (By analysis of machines when they were affected by the effect.) This shows that the effect is not caused by the RTC being wrong, but the RTC being read wrong.

* The Famous 244.

The designers of the original 146818 RTC/CMOS chip realized that you can not read the RTC while it is updating. And they also realized that people would not check the status after every read, so they set the status a little early, while the data is still good, and guaranteed that it would still be good for 244 microseconds (a grace period I call it) so that you could read the status, and then still have enough time to get the data even if the status changed right after you checked it. The problem here is that the BIOS before year 2000 takes less than 244 microseconds to read the RTC, but after year 2000, because of the different logic path it takes due to the century change, it takes longer than 244 microseconds to read the RTC. If you read the RTC at the beginning of a second you have all the time in the world to do it, (almost a whole second, minus the time needed for the RTC to update at the end of the second). But if you start your read at the beginning of this 244 grace period, you only have the 244 to do it in. The reason that there is a different logic path after year 2000 is because of the way the PC keeps the date, and the way it reads the date from the RTC. The RTC has second, minute, hour, day, month, year, but the PC keeps it as clock ticks since midnight, and days since 01.01.1980. So, before year 2000, you just subtract 80 from the current year, and adjust for leap years, and count the days in the current year. After year 2000, you have to adjust for the new century, and count the years since rollover, and the days in the current year, and the extra days in leap years, and calculate whether or not year 2000 is a leap year. And on top of all this, you have to convert from BCD ;-)

Here is an snippet of an rtc.txt file from a machine that has TD, a 486 DX 33)

35 0 51 0 21 0 7 6 1 0 26 20 35 0 51 0 21 0 7 6 1 0 26 20 35 0 51 0 21 0 7 6 1 0 a6 20 35 0 51 0 21 0 7 6 1 0 a6 20 35 0 51 0 21 ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff ff ff ff ff ff ff a6 20 ff ff ff ff 21 0 7 6 1 0 26 20 36 0 51 0 21 0 7 6 1 0 26 20

The above machine has the effect and has demonstrated that fact many times.

The example below is from another machine that has the effect, notice the difference in the error,

39 52 31 35 15 23 0 5 4 98 26 19 39 52 31 35 15 23 0 5 4 98 26 19 39 52 31 35 15 23 0 5 4 98 a6 19 39 52 31 35 15 23 0 5 4 98 a6 19 39 52 31 35 15 23 0 5 4 98 a6 19 39 52 31 35 15 39 39 39 39 39 a6 19 39 39 39 39 39 39 39 39 39 39 a6 19 39 39 39 39 39 39 39 39 39 39 a6 19 39 39 39 39 39 39 39 39 39 39 a6 19 40 40 40 40 40 40 40 40 40 98 a6 19 40 52 31 35 15 23 0 5 4 98 a6 19 40 52 31 35 15 23 0 5 4 98 a6 19 Notice that the status on both machines turns high before the data goes bad, and that both are known to exhibit the effect. The effect happens only on startup. Also, the effect doesn't happen every time you start the computer. If your computer is susceptible to this effect it will not happen every time you start the computer, but on random starts.

General characteristics of the Bug. (The cause of the effect.)

RTC, buffered vs. non buffered. The RTC in a PC computer is a battery powered clock that for most of its one second cycle relays the same thing to the user. We tested this device as a means of discovering what may be happening to cause this effect, and to make the effect show itself. We discovered that just before and while the RTC is changing from one second to the next it displays a flag to tell the user "do not read from me now." During our testing a pattern came to light. Those computers that showed no errors from the RTC during the time that the update flag was set to high also did not show any signs of the effect. We learned that these RTCs have a double register buffer, allowing them to update their time internally, while not showing any of the errors that the non-buffered RTCs showed while updating. We decided a quick way to test for this effect would be just to check for this buffering, and passing those computers that have a buffered RTC Chip.

BIOS code, how it could be reading the RTC the wrong way.

Because computers that have buffered RTCs do not show this effect, and because our testing has not had a single computer show this effect prior to the start of the year 2000, and because this effect seems always to occur at the start up of the computer, we concluded that the effect was in some way connected with how the RTC is read by the startup code. Further investigation showed that the BIOS code that reads these chips has three paths to follow depending on the value of the Century Byte stored at register 0x32 of the CMOS memory. If the value is anything but BCD19 or BCD20, that is an error value, and then the BIOS date is set to an error value such as "01.03.1980" If the value is BCD19, the code follows one logic path, and if it is BCD20 it follows a different path. This difference in logic path between BCD19, and BCD20 is the only difference in the code that the computer will follow in the whole start up code if all other things are not changed (such as was done in our testing). This points directly to this change in logic path as the only possible difference that would allow this effect to happen. It is my theory that this difference in logic path changes the amount of time used to read the RTC enough to allow it to still be reading the RTC while it is in its update mode, and if the RTC is not buffered, then the value being read from the RTC at that time is not reliable, and can cause the effect to occur.

5. How to test for it

The original test to prove the existence of the effect is to set the date/time of your computer beyond rollover to year 2000, and set a control clock to the same time/date. Then use the computer normally for 2 weeks, with the exception of assuring that the device is powered on/off once or twice a day by hand. Every time the device is restarted, compare its time/date to the time/date of the control clock. If there are any differences log them. These differences are the signs of the effect. Also log any loss of device settings, and compare these to any normal such occurrences, if any. These may be the hardware problems that are associated with the effect. The long term of this test is required because of the randomness of the effect. It may not occur on the first try. Also the power on/off by hand is necessary as we have not yet had anyone who has tried to do this with a mechanical or software device be able to produce the results. This is due to the loss of randomness by the introduction of the machine or software. In a PC most of the interrupts are tied to the clock tick timer interrupt, and this means any software that runs on these machines is not going to reboot the machine randomly at all. Also power off can not be done by software, and having all of the drives spin down, and the electrical components drain is more of a "real world" shut down for PCs, so that is the standard we have stayed with.

We have developed the "TD Tool test Suite". The requirements for the test suite being a quick way to determine susceptibility to this effect (for the full requirements of the test see "Requirements for a test for the Crouch Echlin effect.doc" Echlin June 1998). Also see our web page at http://www.intranet.ca/~mike.echlin/bestif/tdpro/ for test results and procedures from our testing. That page and the others on our web site are being updated at random and irregular intervals.

6. How to solve it.

a. low tech workarounds

Forcing the user to input the date/time every bootup, or reading the date/time across a network/intranet/internet from a computer that is known not to have this problem is one way to deal with this. These of course will not fix the hardware side of the problem, and leave your system open to the user just automatically accepting the date/time as displayed, or the machine that you are getting the time/date from may be wrong for other reasons. Also waiting until the computer is connected to the network leaves a fairly significant amount of time for problems to occur like the accounting package to start. Or the user to save files while waiting for the network connection, especially with today's multitasking environments.

b. software fixes

The problem can be fixed with software, and we have a simple solution as part of "TD Tool test Suite": tdfix.exe, a non-TSR program that runs once at startup and corrects any time and date errors before they can corrupt your data. There are problems with this approach too. This software solution doesn't directly address the hardware instability/loss of hardware problem, although in testing none of the computers that have this fix installed have shown the hardware problems since testing of the fix started. The fix must also be run in a single task mode. So on multitasking machines it must be run in the part of the startup that occurs before the multitasking environment is initialized, that is, it must be run from config.sys (or the win9X/winNTor UNIX equivalent), before the multitasking OS is loaded. Also in an embedded system, that is not based on a PC, any software fixes will necessitate creating a new ROM chip.

c. hardware fixes

Hardware fixes can be made in one of two ways, replace the RTC with an RTC that is buffered, or replace the device with a device that has a buffered RTC. We have not observed the hardware instability/loss of hardware configuration problem in devices with buffered RTCs. The RTC replacement can only be used if the RTC is a "stand alone" RTC/CMOS chip. Many RTC/CMOS's are being incorporated into such things as VLSI or other integrated chipsets and super-io devices. In those cases changing the RTC is not possible, so replacing the motherboard with a board that has a buffered RTC, or the whole device with a device that has a buffered RTC is your only hardware fix option.

7. Quick recap

a. the problem

The Crouch Echlin effect is a random jump in date and time, that occurs at random boots of affected computers and embedded systems only after rollover to year 2000. It can also be accompanied by a loss of some hardware and CMOS settings. The systems that show the Crouch Echlin effect have one thing in common, a real time clock (RTC)/CMOS that, if accessed during the once a second update cycle, gives bad data to the BIOS POST date/time routine.

b. real & potential dangers

On computing devices, irregular random jumps in date and time can lead to accounting date errors, file date errors, wrong dates being inserted in data, historical data being lost or misdated, and scrambled, plant synchronization loss, which could lead to possible plant control loss.

c. testing

The only way to know if you have this problem is to test for it. You can use our 2 week test procedure, or check for the underlying causes of the effect. This can be done by looking to see if your devices have an RTC that is not buffered, or use our "TD Tool test Suite" software to do the checks for you. No currently available, (as of date of writing), commercially available year 2000 test tool checks for this problem. You have to do a different test for this effect.

d. our solution

We have developed a solution for this problem, our "TD Tool test Suite", and we are continuing our research into this effect on all platforms.

e. further research

Our research to date has been primarily on Intel based PC/clones, although we have tested and observed the effect on other micro-computer based platforms. Our research is currently focusing on devices that are used in the production, distribution and regulation of electrical power. We are also looking at the effect with respect to UNIX Platforms. We are seeking partners to augment our funding, and provide equipment and systems to test.

References.

"Best if used before Dec. 31, 1999" the Official Crouch Echlin effect website. http://www.intranet.ca/~mike.echlin/bestif/index.htm

"Procedures and results" page from the Crouch Echlin effect web site. http://www.intranet.ca/~mike.echlin/bestif/tdpro.htm

Jace's TD Page, http://www.nethawk.com/~jcrouch/dilation.htm

"Year 2000 Overview" http://www.intranet.ca/~mike.echlin/bestif/y2k_over.htm

The "Frequently Asked Questions" from the comp.software.year-2000 news group. http://www.computerpro.com/~phystad/csy2kfaq.html

Rick Cowles' "Electric Utilities and Y2K" http://www.euy2k.com

Jace's "How TD Occurs" page. http://www.nethawk.com/~jcrouch/second.htm

Good books on the inner-workings of personal computers and how best to work with them.

The BIOS Companion - Phil Croucher (A must have. Click here to go to his page.) ISBN 1 872498 12 4 DOS Programmers reference guide - Dettermann and Johnson, QUE. ISBN- 0-88022-790-7 Programmers Problem Solver for the IBM PC, XT & AT - Jourdain, Brady ISBN- 0-89303-787-7 Inside the IBM PC and PS/2 - Norton, Brady ISBN- 0-13-467317-4 Systems Analysis & Design Methods - Whitten/Bently/Barlow, Irwin ISBN- 0-256-07493-3 Assembly Language for the PC - Socha and Norton, Brady ISBN1-56686-016-4

Acknowledgments.

Jace for daring to "just see what happens." Elizabeth for putting up with my spelling. Mark and Barry for recreating our results and proving we weren't crazy. The regul

-- Michael Taylor (mtdesign3@aol.com), November 02, 1998.



Well... you will have to go to the link to view the charts... they didn't work too well... but the text is pretty clear.

Just to clarify... I, Michael Taylor @ mtdesign3@aol.com originally posted a question on euy2k.com regarding the Crouch/Echlin effect and if this effect might impact utilities. One of the answers is what was used to start this thread. You can find both the original question and this answer on Rick Cowles website at euy2k.com. ====================================================================

Mike =================================================================

-- Michael Taylor (mtdesign3@aol.com), November 02, 1998.


Michael, thanks for the post on the Crouch/Echlin effect. I followed the original reports as the drama unfolded and have in fact tested for this problem. We were not able to identify any PC's or PLC's with this problem. And as I recall from the evolution of this story it took a long time and a lot of researchers to come up with a handful of systems that demonstrated the problem. I have to admit to having long since lost interest in Crouch/Echlin but you've spurred a desire to follow up on the current information. Nevertheless, the mindless dribble that began this thread is still irresponsible and cannot be supported by the facts. Do you have the facts to support the claim that "...any process that involves more than 3 IBM PC/AT compatible boxes the chances are greater than 90% that the system will fail. Period."? This is crap Michael, if it's not crap then show me supporting evidence. Crouch/Echlin doesn't cut it. Maybe the lay folks were impressed but you haven't submitted any proof for the ridiculous claims that began this thread. Y2k is serious enough without the hyperbole.

-- Woe Is Me (wim@doom.net), November 02, 1998.

You have to read exactly what was written Woe. He says PC/AT which means 80286 chips - possibly 80386. The AT was done away with when the EISA and Microchannel busses started duking it out and lost to PCI. So, taken on its own terms it is perfectly correct. But I have not ever seen an embedded PC in a factory. Every designer I ever ran into puts the PC in its own box and connects with a RS-465 or 435. This is so it is easy to upgrade or fix the PC. I suppose you might run into a few somewhere in the real world - but I don't know where.

-- Paul Davis (davisp1953@yahoo.com), November 02, 1998.

I can't find any info. to confirm this so-called Crouch/Echlin effect on the web. The only sites I can find which support it are those of Crouch and Echlin themselves.

As a side note, it seems a bit presumptuos for someone to name an "effect" after themselves.

-- Buddy Y. (DC) (buddy@bellatlantic.net), November 02, 1998.


Just as one more note in this disharmony. Consider how much of the embedded "testing" has been done. Take Beavis and Butthead looking at embedded system X and doing the following. Open the box/cover/interface/whatever, roll the date forward, watch a few iterations of whatever it does (open valve A, if pressure in pipe equal/higher than +limit+, open valve B else close valve A). "Ahh, no problem here! Lets do one more and break for lunch." Little do they realize that the nuke cooling system control they just passed off as "no problem" will later result in the world's biggest nuclear accident. You see, the C-E effect takes lots of time to show itself! Days, weeks maybe months will go by and then (no not a Date problem) an I/O problem. This thing can cause random loss of I/O because it trashes part of the BIOS. So doing a 5 minute check is virtually useless. I've seen a 386 do this!! It "lost" a serial port after about 2 weeks of wandering/expanding dates. All you O Woe is Me disbelievers, get a handful of old systems and try it. Some will work and some won't. If you are half way computer literate, you can do the tests yourself.

-- R. D..Herring (drherr@erols.com), November 02, 1998.


If they've even tested that much. More likely, open the cabinet, look at it, "Nope, no computer in here...just these here controller boards, the ones they said were okay, remember? Well, put the cover back on. This one's alright."

-- Robert A. Cook, P.E. (Kennesaw, GA) (cook.r@csaatl.com), November 02, 1998.

RD - you just don't see 286's around any more. 386's are pretty rare too. Some 486's then a lot of 120mhz plus Pentiums and later chips. Very few Pentiums have any problems - mostly confined to 90mhz and below machines. And replacement boards for 486's should be Y2K compliant - with new BIOSes. The 286 and 386 are just worn out though - I don't think they are anything to worry about.

I have assumed the Crouch effect is due to a mismatch in the RTC and the BIOS - that sometimes the clock ticks cancel each other out - sometimes they count double. Is the effect a random walk around the correct time or does it change in one direction once it starts to fall apart?

-- Paul Davis (davisp1953@yahoo.com), November 02, 1998.


The reality of most embedded systems inventorying is that nobody does anything, other than contact the manufacturer and get a written statement regarding Y2K compliance. If indeed this Crouch/Echlin phenomonen was never known nor tested for by the manufacturer, then this is potentially very bad news indeed.

-- Jack (jsprat@eld.net), November 03, 1998.

Paul,

Crouch is an old problem, generally, but certainly new as it relates to RTC and BIOS issues. Speaking as a veteran of far too many coffee ODing, all-night, head scratching, hair pulling, etc. incidents of trying to figure out why a particular bit of circuitry was doing something like Crouch, I'll say that almost always the cause of intermittent, "needle in a haystack" failures like this were the result of either supply voltage problems or signal timing problems.

I'm sure you know that the hardware keeps the time in the form of a binary number which relates to some arbitrary base date decided on at the time of design. Well, those numbers get larger with the passage of time and as they get larger, the hardware takes longer to process them. The increase is extremely small, likely on the order of pico seconds, yet at some point, the processing time gets long enough to matter. It's the straw that breaks the camel's back.

Crouch/Echlin has nothing to do with Y2K, per se, but as the number that represents the current time (century, year, month, hour, minute, etc. are all in the same binary number) gets larger, more and more hardware will be affected. For example, I have an old 8088 machine on my network right now that is still perfectly adequate for simple tasks and it has been exhibiting the effect for several years. I didn't know it was called Crouch/Echlin until very recently.

The "window of time" that the original hardware design allowed for the accessing of the RTC data is finite in length and at some point, depending on the size of that binary number AND the speed of the circuitry involved, we reach the edge. At that point, the circuitry attempts to use data which is "sometimes" invalid and to quote IBM, "results are unpredictable".

Electric current will travel about 18 inches through solid copper 30 gauge wire in about 1 nanosecond. The size and composition of the conductor will vary that amount of time. In this way, whether a land pattern in the circuitry involved goes clockwise around a VLSI chip or counterclockwise around a cluster of discrete chips or straight from point A to point B or is made of a slightly different alloy will make "logically" identical circuit boards perform differently with the same set of commands.

A "system clock" is really just that big binary number, as stored in a particular location in the system's memory. It initially gets the number from the RTC chip, but often never references the chip again until the system is restarted. Crouch/Echlin most often puts a bogus value in that memory location, but if the bogus number is screwy enough, it may overlay other memory locations, such as the ones that keep track of the com ports.

Obviously an incorrect time in the system clock will have the same effect on any given system whether or not it's 1900 instead of 2000 or 7843 instead of 1998. It doesn't make any difference where the inaccuracy came from, only whether that particular system "cares" what time it is.

As RD pointed out, the I/O corruption is an entirely different matter.

As to the presence of 286 or 386, etc. based systems, you're right on the money as far as the desktop universe goes, but not necessarily so in the embedded systems universe. If the baggage handling system at a medium sized airport, for example, is based on a 286 driven control system, and the task has not changed (increased amounts of baggage or different number of gates, etc.) it is unlikely that the system will have been upgraded. The type of processor is more likely determined by what Intel or Motorola was selling as state of the art at the time the system was designed.

The jury is still very much out on the embedded systems issue, and I honestly can't even guess how many of them will go nutzoid. I do think that the ones that fail will most likely just stop, rather than anything dramatic like crushing suitcases in an airport.

-- Hardliner (searcher@internet.com), November 03, 1998.


Robert, your comment brought to mind a Dilbert comic strip that I read a few months ago.

The manager was asking the techie how the Y2K remediation was going so far.

"Really good", he replied. "I've been working on this for six months and haven't found a single date yet. As a matter of fact, all I've found so far are millions of 0's and 1's.

-- Craig (craig@ccinet.ab.ca), November 03, 1998.



Hardliner is correct about the embedded universe. I have a liitle experience with one manufacturing system. Installed a tracking production system (circa 1985) for a very large steelmaker at multiple plants. Got involved with data from some process control units. The 286's were still being used in 1994 when some of the plants closed. I would agree that there are multiple ways that RTC/BIOS calcs can screw up. ANd usually the results aren't fatal or even noticeable. But consider the scale involved. Perhaps someone can give the exact answer, but I was under the impression that the heating and cooling systems of large buildings (skyscrapers, hospitals, the Pentagon) are often controlled by legacy controllers.

Lastly, I don't understand how anyone could attribute a wandering/expanding RTC to random voltage spikes. If the RTC functions normally with date/times up to 12/31/1999, but exhibits TD beyond that, whats the logical link to voltage. The power supply has no idea what the RTC is doing. As I understand it, there is a century byte routine. 1900 is 0. 2000 is 1. The BIOS routines take longer when the byte is non zero and thus a register error can occur erratically.

-- R. D..Herring (drherr@erols.com), November 03, 1998.


R.D.,

The link to voltage is not logical, but physical. There are multiple voltages at which a circuit will function reliably and multiple voltages at which it will not. The line between the two is invariably blurry and if the supply voltage is in that "twilight zone", the circuitry behaves erratically.

In the case of Crouch/Echlin, I would not expect voltage to have anything to do with it, since the results (corruption of memory) are consistent but infrequent.

I guess I confused the issue by bringing voltage into the picture at all, but my intent was to classify Crouch/Echlin as a signal timing problem, as opposed to a power supply effect.

-- Hardliner (searcher@internet.com), November 03, 1998.


I don't know for sure - so count this only as a guess please - but I'm very inclined to believe your answer on the older plant still using 286/386 as being very likely, rather than the exception. I don't think many "shop floors" will be updated to the latest PC/controllers.

A manufactoring process is not really very "speed sensitive" once a controller is installed. For example, counting the cans going down a hopper, labeling Coke expiration dates, or weighing ands recording fill times won't change very much. If the original PC/AT circuit board didn't fail, it won't get replaced. A faster one won't do more than the slower one - the cans aren't going by faster, the bottles aren't getting filled faster, the steel ribbon is still the same size it was ten years ago. So in fact, there are many $$$$ reasons for a factory to try to NOT change and upgrade over the years. Including the "ain't-broke-don't fix" syndrome, but also, "can't replace, so don't ever change anything."

Also to consider - thinkof the lead and design time to move from a new controller chip to a sold and running process in a factory or carpet plant or furniture mill (or whereever.) Thes eplaces won't buy things quickly, they will purchase slowly, and demand proof of performance before purchasing. This slows down install times. From a new chip - to a new process design - to a new machine using that chip - to a new sale - to "build the machine" then to install it may take several years. That is the development time to get to the shop floor _after_ a chip is produced. So the shop owner will want (will need) many years of production (payback of 5-15 years) from his new toy before he replaces it. Or begins to consider replacing it.

Compunding the breakdown symptoms if things do fail in early 2000 - not all will fail at once - is the difficulty in finding the original elctronics engineer and mechanic who first "tweaked" it into running at all. Those skills may be long gone in many places.

-- Robert A. Cook, P.E. (Kennesaw, GA) (cook.r@csaatl.com), November 03, 1998.


Oopsie! If this can happen now, uh, human error, what will the Y2K clean-up costs be?

Diesel Spilled In Sewer System

Diesel Spilled In Sewer System
Oregon State University Will Pay $10K For Cleanup

CORVALLIS, Ore., Posted 9:19 a.m. December 31, 1998 -- Three hundred gallons of diesel oil spilled into the Corvallis city sewer system when employees at the Oregon State University heating plant left a drain valve open.

Workers at the wastewater treatment facility caught the oil before it did any damage to the treatment system.

The university will pay the $10,000 cost of cleaning up the fuel. The Department of Environmental Quality plans to issue the university a notice of noncompliance for leaving the valve open.

xxxxxxxx xxxxxxxx xxxxxxxx xxx

-- Leska (allaha@earthlink.net), December 31, 1998.


Moderation questions? read the FAQ