Enterprise systems, real world (long, technical and scary)

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

I have taken the liberty of cross-posting from csy2k, because I think this is an important message and emphatically worth reading. It's quite technical, but I think even the non-technical people here will get the gist of it. To me, one report like this is worth 100 surveys by Taskforce 2000. Check it out
-----------
I am what Cory would describe as an Assembler gear-head. Big-iron, 20 years system software development, VTAM, NCP, blood'n guts machine code programming on three continents. And, believe it or not, before that I was a tax accountant.
For the cause (read; money), I've spent the last two years working on Y2K projects.
I feel like I know many of you here. My daily routine includes checking out Paul's informative, fascinating and often hilarious posts, and chuckling at BKS and Don Scott's predictable rebuttals. And Cory's commentary will one day make for a good play-by-play coverage of those crazy days leading up to the turn of the century when food was plentiful and rates were high. (Cory, forget the farm, get youself stranded on Oahu for the duration. You like fruit and fish too, don't ya? - and no woodstove!)
For those of you who must know if I am a Pollyanna or a Doomsayer, let's get that out of the way right up front; I have no idea how bad it will be. Neither do you. My personal experience leads me to believe it will be catastrophic. But I am not an embedded systems engineer, neither do I know much about utilities other than what we have all read. Those of you who do know about those things probably know very little about my side of things.
What I do know about is large-enterprise business systems, and how dependent business, and thus, the global economies are on those systems. I also know how difficult it is to remediate those systems for Y2K.
(By the way, have you ever noticed how spell-checkers trip over the word 'remediate' yet we use it all the time?)
Let me describe a typical Y2K project from my experiences - I'd like to know if this is true elsewhere - let me know.
Let's start with a large government department - they have multiple mainframe legacy systems going back 20 years or so, as well as every platform-of-the-day application you can think of. They decide its time to think about Y2K and after going through the usual bid process and ashort pilot project, (read; after one year) they pick a consulting firm to take on responsibility for all aspects; legacy code, embedded systems, building systems, etc. The chosen vendor has a solid reputation as a supplier of system migration consulting services, specializing in COBOL.
So,its spring 1998, the contract is signed, and the project is underway. That is, as soon as the vendor can hire the necessary skills. (I'm not kidding.)
The main requirement is for MVS COBOL skills. Well, just COBOL will do - its virtually the same on any platform right? So they get a headhunter on the job and go shopping for programmers. Now keep in mind that by this time, Cory has already started ranting about rates, and good legacy skills are almost nowhere to be found. (Calm down, I said **good** legacy skills.)
This where I come in. Fortunately for them, I had just arrived back in town after doing some contract work in Europe and the USA, and was tired of being on the road (I live in Canada). So what the hell, I needed a rest and some time with my family so I signed on as their first employee on the project. I would be the senior MVS guy who could provide guidance to those less experienced.
On my first day, I was introduced to the other two hirelings, a nice lady who had come out of retirement to refresh her COBOL from 15 years ago, and a fellow from the middle-east who was still struggling with english. He also knew COBOL but wasn't sure what MVS was all about.
The initial task at hand, was to learn how to use an ISPF based scanning tool to perform impact analysis, provide line-counts, and identify the potential trouble-spots in the COBOL code. Sounds simple enough right? I said ISPF-based. (You hit PF3 to get out. No, now you've split the screen, hit PF2 again. No, now we're back at the primary option menu, you must have hit PF4. No, I'll teach about split screens another day. AARGH! OK, that screen submits the scanning job. A job? well that's how MVS performs batch processing - JCL stands for Job Control Language - What? no, you can't write JCL in COBOL...)
By some miracle, we have actually fixed and tested a number of systems and gotten them back into production. Will they function after 31/12/99? I hope so, but I wouldn't bet the farm on it.
Our crew today consists of the two I mentioned, half of South Africa, India, Taiwan, and the old folk's home down the street. I think we have a couple of Canadians too. Don't get me wrong, most of these are good people who are working very hard on an operating system they know little about in a language they do not yet speak or understand clearly. My point is that considering the urgency of the project, and the uncertainty regarding how long it will take, perhaps it should have been staffed up differently. (But that would have been expensive!)
Six months into the project, I decided I had had enough so I resigned. I got a call at 4:00am from the owner of the company who was vacationing in Florida. Please don't go. The project is over if you go. We can't replace you.
Hmmm, here's financial opportunity if I ever saw one. I stayed. As a subcontractor.
It turns out that staffing was the least of our problems. Trying to get the application areas to let go of the code and let us fix it is darn near impossible. We are still screaming at them to let us work on mission critical stuff. "But its not available yet - we have too many outstanding urgent service requests from the users."
At the moment, we are working on an ancient, mission-critical COBOL financial application consisting of about 500,000 lines of code with absolutely no documentation. Anyone who ever knew anything about the application has long fled the government for jobs that pay a fair wage.
That's the batch side of it. The online side is written in ADF. The IMS database segments are hard-coded everywere, and to prevent having to restructure the database, they have been redefining parts of segments with overlays for the past 20 years. Those parts of the overlay not referenced by a particular program are defined therein as fillers, but may contain serious data. The batch cycle starts with a database extract to flat files, containing multiple record layouts which are referenced in the code something like:
01 INPUT-REC. 05 FILLER PIC X(86). 05 FIELD-I-WANT-TO-USE PIC X(2). 05 FILLER PIC X(327).
Of course the fillers contain all kinds of 6-digit dates and the Year 2000 "tool" in use cannot identify them.
Are we going to find them all? You be the judge. My original estimate for completion of this particular application was 31/10/98, then 31/01/99. I will no longer give a firm date. The fiscal year-end is 31/03/99 at which time it **will** fail. Joanne is right.
This post rambles a lot and for that I apologize. But this is therapy for me (twitch-twitch)- so read on or move on.
Perhaps I am involved in a particularly whacko project, but what concerns me is that in most other aspects, this shop is typical of all those I have worked in around the world. So I assume their Y2K project is also typical, and if that's true, we are in trouble.
I was involved in the start-up of another Y2K project for a large multi-national bank in London. I can't identify them but their initials are CITIBANK. I was supposed to be their main Assembler guy but I ended up setting up CICS 4.1 regions for them for their 6 European 'branches'.
I did some initial assessment of their assembler code and it was ugly. Very ugly. Unfortunately, I was forced to cut the project short and come home due to a death in the family so I didn't see the project through. In all fairness, perhaps they have since completed their project successfully. But I didn't come away with a good feeling. They were having a hell of a time finding skilled people. They had to go to Canada to get me!
You're going to love this one. Around 1990, I was involved in the development of a massive EDI client-server communications system in Seoul, South Korea. The server is an MVS mainframe, which deals which all of their suppliers and customers across an SNA/NPSI/X.25 connection to primitive PC apps at remote sites. The underlying messaging protocol is X.400.
The X.400 code was originally written in C for UNIX. We took that code and ported it to MVS using the Waterloo C compiler. Instead of using CICS or IMS, it was decided (not by me) to use a home-grown TP-monitor to handle all transaction processing and SNA communications. Now this platform was a monster, written entirely in Assembler/370 out of necessity, it was understood only by those who had developed it in the first place. You know where I'm going with this.
The company who developed the system is no longer in business. I went back to Seoul independently in 1995 with a friend and overhawled the system from top to bottom for performance reasons, but Y2K was not an issue for the client at the time.
The original UNIX code, as well as the transaction processor are not Y2K compliant and will cease to function at the end of this year. I don't know if the system is still in use, I presume it is. It is a proprietary system, there are no shrink-wrapped replacements they can put in its place. They don't have the source (I do.) Their economy is in a mess. I havn't heard from them.
They are the 3rd largest steel producer in the free world.
So, enough said. I don't know if its TEOTWAWKI, but based on my view of things, I think we are in for an economic event such as the world has never seen. Perhaps that will trigger Infomagic's spiral, perhaps it won't. Either way it won't be pretty.
I can't do the Milne thing. My children are grown and no-one will come. Fortunately, I live in an area of mild climate. I am stocking up on as much as I can to cover the extended family for as long as possible. I'll know how long when I can no longer buy supplies.
As for Pollyannas and Doomsayers? I can understand the motives of most Doomsayers - if you believe Y2K will be bad there is perhaps a moral obligation to warn others in as loud a voice as you can muster - whether or not you end up being mistaken. What motivates the Pollyannas? I have no idea. If you think Y2K is nonsense, why are you wasting so much energy proclaiming it? Ignore it and it will go away soon enough.

-- Flint (flintc@mindspring.com), January 22, 1999

Answers

Most of the report went over this layman's head, but one section did strike me as being very similar to Alan Greenspan's testimony before Congress (which I believe is available through links on Sen. Bennett's site) in which Greenspan recounted his days as a programmer and the fact that he (Greenspan) would have great difficulty in going back now and remediating code that *he* wrote in the 1970's because proper documentation no longer exists.

-- Puddintame (dit@dot.com), January 22, 1999.

Thank you for this one, Flint. I gave up monitoring csy2k long ago.

-- No Spam Please (anon@ymous.com), January 22, 1999.

Flint, why do your still have the source?
MoVe Immediate

-- MVI (vtoc@aol.com), January 22, 1999.

MVI: its a repost from c.s.y2k gearhead (flint is mr. embed. sys.)
Flint: a good pull. This is what I see also. For instance managers that do not understand that when a senior programmer leaves, they're not just losing a good coder, they're losing part of the users manual to their software. All those little things in the code that someone somewhere used to know so well now have to be painstakingly relearned one at a time. It's comparable to someone coming out of a coma and having to learn how to talk again. Most managers I have dealt with have the technical inclination of a turnip seed.

-- a (a@a.a), January 22, 1999.

Thanks Flint.
This guy reminds me of myself, I have also worked around the world, Uk, France, Germany, Saudia Arabia, East/middle/west coast of US - from what I have seen on those projects those companies are at a very real risk of being burnt toast next January. Things are better in the US and the UK, but not much better.

-- Andy (2000EOD@prodigy.net), January 22, 1999.

If you got this far and your sides don't hurt from the laughter, and you aren't revamping your shopping by X2, doesn't understand teh tech background for Y2K. Flint, this is one of the worst horror stories I've had the pleasure/pain to read. this guy knows where some bodies are buried, and the owners don't even know they've died!
Chuck

-- Chuck, night driver (rienzoo@en.com), January 22, 1999.

I have to agree with Chuck- I sat here bursting out laughing (while not almost crying internally) at some points in this post. Sure has the ring of truth to me.

-- Drew Parkhill/CBN News (y2k@cbn.org), January 23, 1999.

Wow. I need to be checking to NG more frequently. This is the exact type of post I've been looking for!
Code on!
I'm a-codin' too...gotta get that woodstove installed soon...

-- Delete (del@dos.com), January 23, 1999.

I especially liked Cory's response to Toast;
**You have pushed me another notch over to the dark side. I'm sorry but your uncertainty makes it that much certain.
Your uncertainty confirms much of my sense of this. There is a wall at December 31, 1999. On the other side of the wall, there be dragons...
Toast, I would like to rerun an expanded version of this article in a future WRP. This is the scariest one I've seen in a while.
cory hamasaki 345 Days, 8,283 Hours
more Y2K stuff at http://www.kiyoinc.com/current.html**

-- c (c@c.c), January 23, 1999.

Flint --- "Don't we have Microsoft and Intel? ..... quoth Al Gored?"
Anyone who knows systems knows that this is worth 500 industry reports on "compliance percentages." EXACTLY the way things are and always have been. But I'm not gleeful: Infomagic, here we come?

-- BigDog (BigDog@duffer.com), January 23, 1999.

I want my mommy!!!!!!!

-- Sheila (sross@bconnex.net), January 23, 1999.

Big Dog,
Might this be a rosetta key to understanding the constraints of remediation?
Works for me...
~C~

-- Critt Jarvis (Wilmington, NC) (critt@critt.com), January 23, 1999.

Critt --- do you mean this post, "Al Gored" or whaaaaa?

-- BigDog (BigDog@duffer.com), January 23, 1999.

This is an authentic view. The reliance on relatively cheap foreign 'technical help' (and I use that term very loosely) is virtually criminal. Many of the enterprise level systems will be so much smoking wreckage after "remediation".

-- RD. ->H (drherr@erols.com), January 23, 1999.

Yikes!!!!!!!!!!!!!
Sure makes me feel a lot better about all that money my wife and I have been spending lately.
Thanks for the info.
--Jim the window washer

-- The Window Washer (Micaiah@2kgs.bbl), January 23, 1999.

Critt --- Sorry there, I wasn't firing on all cylinders. With due respect to Cory, this is the best (= the worst) piece I have ever read on what is really going on. I'd bet a million bucks on the bona fides of the author, whoever he is. This isn't faked, boys and girls.
Yes, it is a kind of rosetta key, taking the post in order:
... sincere but hapless efforts by outsourced ethnic groups, demonstrating management's low estimate of Y2K's significance. This is not inconsistent with executive management showing some interest: tip-top mgmt does not always get its way with IT mgmt, contrary to what one might think, and often brings in consultants hoping to frighten, goose or otherwise stimulate the blood flow down in the shop.
... resistance to letting the remediation group actually touch the sacred mission-critical code. See my point above. The outsourced group has no inherent authority to work on the "real stuff."
... huge applications with zero documentation.
... hilariously stupid hacks (cf "fillers") that "made sense" at the time but can now bring down entire enterprises.
... home grown TP monitors. Hey, it gave coders something to do and, psst, might keep anyone from ever firing them. Oops, they left anyway.
I have posted earlier this week on the lunacy and scam of compliance percentages. Yes. The other piece of the Y2K scam, worth another post if I can once again keep from barfing, is the Grand Canyon divide between remediating enterprise systems and PC-style applications.
The Y2K effort would have been more rational if we had:
1. Identified the enterprise systems that run the world (2,500?).
2. Remediated them, starting in 1995 AT THE LATEST.
3. Ignored all else (well, not really, we should have put the same effort into embedded systems, also beginning in 1995).
My thesis: while the result would still stink, we would be in better shape than having these 2,000 systems 40% of the way there and all other so-called mission-critical systems 40% of the way there.
40% + 40% = 0 in this case (sorry, Belasco). And excuse me for resorting to meaningless percentages.
BTW, would love to know whether the author of this piece would agree.
So, yeah Critt, this is a rosetta stone all right.

-- BigDog (BigDog@duffer.com), January 23, 1999.

At the other end, my experience in the embedded world is that small embedded systems are often done by a single engineer, whose specialty is hardware (he's NOT a programmer). But it's a microcontroller, it only has 30-40 instructions in the set, how hard can it be?
Perhaps Franklin or Wildweasel have different experiences. But I've often enough been called in to repair the Gawdawful spaghetti some hardware jock crammed into 2-4K of ROM. To the designer, the schematic *was* the documentation. The source is derived using a binary dump from a ROM reader, and an opcode map. Gee, is this a date? It *seems* to be getting updated from an interrupt at regular intervals...

-- Flint (flintc@mindspring.com), January 24, 1999.

Flint --- there is no good data, obviously, but what is your opinion on:
1) the date exposure of embedded systems that "don't need to" calculate dates vs
2) the date exposure of embedded systems that do
or do we need to go down to the chip level of a particular embedded system to answer this question? Does the question even matter with respect to a) fixing and b) consequences.
I am completely ignorant re these devices personally except for research.

-- BigDog (BigDog@duffer.com), January 24, 1999.

Big Dog:
Hate to say it, but I think the only way to answer your question is through the very research you've been doing. I do know that more EE types are employed doing embedded systems in the US than in all other areas combined. Most of them tend to be one-offs, replaced by more powerful, smaller, cheaper one-offs.
For actual chip replacement to be feasible, that type chip must be still available in the same form factor, and of course be a compliant version. This applies to PROMs mostly. Understand that even if the date comes from a 2-digit-year RTC (vast majority) this doesn't necessarily mean there is any compliance problem. Newer PC BIOS still deals with this issue (indeed, if the RTC has century in hardware, I have yet to see a PC BIOS that pays any attention to that hardware register. Not backward compatible, you know. No standards).
The noncompliant embedded systems I'm sufficiently familiar with to speak in any detail, are on manufacturing assembly lines. The worst cases are when you have many robots connected to embedded servers, which are in turn connected to a traditional computer (micro or mini). Often you have two (bad) choices -- replace PROMs or PLCs in each robot with compliant versions not available, or rewrite the code in one level or server or another to handle next century's dates properly. Rewriting this code is the task of the vendor, who must install the update in more installations than he can get to, once the compliant version is actually written.
In some cases, the fastest way to fix it is to replace *everything* with the latest and greatest. This can be incredibly expensive, since at worst it involves reorganizing the process, moving the lines around, retraining the people as well as replacing the offending hardware. But I'm sure you know all this.
My take so far is that fixes at this extreme end are fortunately very rarely required. In most utilities (can't quantify, and it depends on functional installation) the error rate of *systems* varies from 5 to 40%. Of these, reasonable approaches vary from ignoring the noncompliance (cosmetic) which is common, to turning back the clocks and reinitializing any historical data, also pretty common, to powering down just before midnight and back up afterwards (works for boundary condition issues where a single reading or short-term series of readings causes the problem, so don't do that reading or series).
Of course, the problem is that until you test, you don't know what flavor of remediation you'll be facing (if any). And to test (especially the crucial end-to-end testing) you need to shut down the system. For 24x7 operations with long and expensive shutdown and restart procedures, this amounts to destroying the village in order to test it.
All in all so far, my gut feeling is that there won't be too many embedded problems above the minor or annoying level. But there will definitely be some (I'd guess thousands worldwide) that will be really spectacular, requiring the abandonment or rebuilding of the entire plant (and dealing with the collateral damage of the explosions, escaped gases, severe pollution, etc.)
I'd appreciate any input here from anyone whose range of experience in embeddeds is different. I'd trust the judgment of Franklin or WW more than I'd trust my own...

-- Flint (flintc@mindspring.com), January 24, 1999.

Flint (and Franklin/WW) --- we're drifting a bit off-thread (my fault) but the general subject is, uh, Y2K related, eh?
Flint, let's grant your points for argument. Assuming that most chip remediation is nuisance-type, but also assuming that, pre-Y2K, there has rare (never?) been end-to-end embedded system threats, how confident can engineers be in their gut feeling about critical breakdowns? Put another way, since end-to-end testing ain't gonna be done for reasons you explain, how useful is the spot checking that is going on, except, admittedly, it indicates that the "spots" are relatively minor, TAKEN BY THEMSELVES. Or, put still another way (sorry), are the spot problems serial in nature or, at times, combinatorial but not likely to be seen as such till the end-to-end execution takes place 1/1/2000 or thereabouts?
Related to this (and gut feeling fine, pls elaborate): 1,000s of failures of the scope you seem to be suggesting is trivial as a percentage of systems but enormous in its global impact:
Chernobyl(s)? Bhopal(s)? Oil rig/refinery collapse(s)? etc .....
Is the scope you are describing? If so, how do you personally rate the contribution of Y2K software problems compared to contribution of embedded system problems to what is upcoming?
And don't apologize for the intuition part. THAT'S OBVIOUS.

-- BigDog (BigDog@duffer.com), January 24, 1999.

Beautiful post - thank you for the ugly details.
Back up two steps - and think through the data flow at say the unnamed bank with Citibank initials and the Korean steel plant - what will happen if/when their code runs?
That is, what is the result of the un-remediated code in these cases? (Spell checkers don't like one either!) If they have power, and raw materials, and customers, and natural gas, coke, and iron ore and shipping, and trains and payrolls and sales to _want_ to continue in business, what do you think will happen when these programs run? Could the company stay in business with an unrepaired program? How long? What would be product quality and timeliness?
Wrong material, wrong ingredients in the supplies, wrong mix temperatures or heat or cooling period and your alloy steel is a death trap when used in construction or shipping or railroad tracks; or a consumer rust bucket if used in canned food or cars!

-- Robert A. Cook, PE (Kennesaw, GA) (cook.r@csaatl.com), January 24, 1999.

Here's my $0.02 (US).
At this moment I am tending to agree with Flint that the embedded systems exposure in the power industry seems to be smaller and more easily remedied than first thought. I have greater hopes that we will not have a national grid collapse. But it's the stuff that you don't think about that bites you really badly. And since there is no way to do a system-wide test, we still just have to close our eyes and pray really hard at the roll-over. This says nothing about the exposure in the oil and gas industry -- I'm more concerned about those right now than electricity.
And I still hear those words "catastrophic failures" from a GM representative echo through my head. Their assembly lines stopped cold and from what I hear the remediation is non-trivial. (I gotta go ask the assembly plant guys at my company whether they are aware of this!!!).
So what happens to companies like Intel that have decided to "fix on failure" because they can't shut down for remediation? ISTM that the possibility of several major Fortune 500 companies biting the dust from assembly line malfunctions is still very high.
What will happen to the world economy if Intel and a couple of other semiconductor manufacturers buy the farm or even just stop shipping product for three months?
WRT remediating embedded systems, I agree with Flint. I have worked on programs that were so unstable that they broke down if you looked at them cross-eyed. Any change, even the seemingly trivial, sent a wave of unexpected side effects through the system. And this was not a lot of code: 10,000 lines of assembler burned into a 12K 68HC11. (My current work is on 32 bit Motorola processors with C++ and state- of-the-art CASE tools. An 8-bit processor would handle our applications, but it has been decided here -- rightly, I think -- that software documentation, stability, reuse, and robustness (is that word?) is more important than saving $15-20 per controller).
So I agree with Flint that it seems that actual reprogramming of the various embedded systems -- at least in the power industry -- is not often required. But when it is required, we can expect far worse success rates than with mainframe and PC applications. Too much embedded systems stuff is still black magic and when the high priest of a particular project leaves it's all over. Better to scrap the sytsem and start from scratch. And with 11 months left, that is simply not an option -- any system in that position right now is certain to fail.

-- Franklin Journier (ready4y2k@yahoo.com), January 25, 1999.

A real-world example of one factory: 44 automated production systems 24 completed in 6 months 10 to be replaced with newly purchased systems (3/1999 delivery) 10 with unknown completion date (1999 only)
Company started January, 1998 Best scenario = 16 months (1/1998 - 4/1999)
Other company factories with similar systems haven't started yet.
Get the picture?
I'm currently in Thailand and will post some information next week when I return to Japan about the situation here. Good news and bad news.
The dichotomy of Asia: I can't drink the water, but my internet connection is 800 kbits/sec via satellite link.

-- PNG (png@gol.com), January 26, 1999.

"but my internet connection is 800 kbits/sec via satellite link..."
Yeah but how much is that link, and, um, how was Thailand???
later, Andy :)

-- Andy (2000EOD@prodigy.net), January 26, 1999.

Moderation questions? read the FAQ