Hamasaki: Why Y2K won't be fixed in 2 or 3 hours (Once more, with feeling)

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Subject:Re: critique my 'Millenium Y2K' beginning please. A.H. damn it, be kind :)
Date:1999/07/09
Author:cory hamasaki <kiyoinc@ibm.XOUT.net>
  Posting History Post Reply


On Fri, 9 Jul 1999 06:35:50, fixxit@bright.net (steve) wrote:
 
> Henry Ahlgrim <doitnow@gol.com> was, like, :
>
> <<<snippado>>>
> ->appreciated.  My mainframe experience says 7-10 failures a week in a
> ->WELL-RUN shop, but most are not severe enough to reach the news media.
>
> HOLY SHIT!! Please, someone tell me he's wrong about this! Cory? 7-10
> failures a WEEK?? And some of you guys are saying everything will be
> ok??
 
It depends upon what Henry means by failures. 
 
If you sit on the JES console of a big production MVS system, you'll see regular abends, OCx, 622, 122, CICS region fails, etc. roll by.  Waiting in the wings, there's a crew of production control analysts who review the output, make fixes, sometimes call the programmer on beeper duty (like an Arnold-type, "I'll be back."), in most cases, the problems are related to unusual data records making it past the edit filters or scheduling errors upstream from the failing job.
 
Sometimes the cause is environmental, a problem with the operating system, or how operations choose to configure the system at that moment. Examples are a shortage of SYSDA or having a required pack off-line and replying "Cancel" to the mount request.
 
These kinds of problems are the ones that lead our clueless,
lightweight, manager types (I'm not calling Henry that but there are people here who fit that description.  Shhhhush, they might protest and in doing so self-identify themselves as clueless.) to claim that "production problems can be fixed in, oh, 2 or 3 hours, we do it all the time."
 
Well... no, you don't.  You fix minor edits, disk space shortages, jobs run out of sequence, missing input file, equipment check on the output tape drive, CPU failure during database sort, and so forth.  These problems happen all the time.  Restore and rerun, submit the "reset job" and restart, look at this output and type these control cards, back up the system a day, and we'll be caught up by Saturday.
 
You don't repair fundamental design constraints in the system in, oh, 2 or 3 hours.  These problems take days, weeks, months to fix.  In some cases, the enterprise will fail, cut off from revenues, angry
clients bearing torches, before the system can be recreated.
 
As fast as MVS boxes are compared to PeeCee's, it still takes time to rerun, restart, do the processing.  These 7-10 failures can affect production schedules.  In rare cases, the delays are long enough to be visible to the outside world.
 
This is not the Y2K problem. Don't confuse the two.  I've cited a number of IT failures, Oxford Insurance was one, where IT problems have persisted for months or years.  Fundamental design constraints cannot be fixed in, oh, 2 or 3 hours.
 
This is where our Polly-Pals are way, way out of their league
 
The minor flaws in  production systems are designed, sometimes by people, sometimes by a kind of law of nature, survival of the fittest, to be recoverable in, oh, 2 or 3 hours.  In part, it's because these kinds of small problems happen all the time.
 
A mainframe shop will spend millions to buy the hardware to complete processing in a batch-window.  After about 10 at night, the system will typically idle down, a few backups will be running, maybe a database reorg or two but the peak load is over.  The hours between midnight and 6 AM are usually near zero percent utilization.
 
In a sense, that time is there to allow for frantic repairs of
production failures, restarts, reruns; that's where the fiction of fix on failure in, oh, 2 or 3 hours came from.
 
Here's another analogy.  There's perhaps the same density of household fires and false alarms.  A barbecue sets some bushes on fire and engine company 6 is rolling.  So the fire-polly says, hey, all fires are extinguished in oh, 2 or 3 hours.
 
Well, no.  We're about to see a world-wide collection of cities burn like Chicago after Mrs. O'Leary's cow, San Francisco after the
earthquake, Tokyo and Dresden after the firebombing.
 
And cities are full of fire houses where firefighters, watch videos, rest, play volleyball, and idle the hours away.  This idle time is an analogue to the midnight to 6 AM idle time in mainframe centers as well as the wasted cycles at Sungard and the other recovery centers.  It's designed in because someday, you will desperately need those resources.
 
Or not, how lucky do you feel today?
 
Here's the proof that fix on failure will not work.  Numerous companies have reported spending years and hundreds of millions of dollars painstakingly fixing Y2K problems.  Why didn't they just come in on a weekend, the 4th of July would have been a choice, roll their clocks forward, take the failure and fix it in, oh, 2 or 3 hours?
 
They didn't because although this would expose many, perhaps most of the Y2K problems, they know they can't fix these problems in, oh, 2 or 3 hours.  Fixing one incarnation of a date problem could take a few minutes or it could take months.  Fixing all problems will take years.
 
Finally, the myth of "all systems run with flaws" should be put to rest. The truth is that these are, by definition, trivial, non-fatal, rare. Otherwise they would not pass QA.
 
In 175 days, there will be a new world order as far as software system flaws go. These will take days, weeks, months to repair. The rules of the game change and companies will fail.
 
For me, the question is how many companies will fail outright and how long will it take for them to realize that they are the walking dead?
 
Is it five of the Fortune 500, all of them, or something in between? What if your employer is one of the failing companies and your life savings are invested in a couple of the others?
 
> ->Another factor to consider is that news media will be waiting and
> ->watching to trumpet ANY failures that week, even the background failures
> ->that would not be newsworthy any other week. (As they were watching on
> ->1-1-99, 4-1-99 and 7-1-99).
>
> It seems to my (uninformed) mind that any y2k failures will be *in
> addition to* these background failures (7-10 a WEEK??), and
> programmers could quickly get overwhelmed trying to fix everything.
 
The 7-10 don't count.  Those are just car fires.  The problem is not just larger, it's of a different class.  It will be like city size burn outs.  (I'm not expecting cities to literally burn.  I do expect some city size enterprises to experience IT problems that drive them out of business.)
 
The Y2K problems will be far worse than our Broomie-buddies want to face up to.  And as Info says, I could be wrong, it could be worse.
 
cory hamasaki http://www.kiyoinc.com/current.html
I've added an open listserv to current.html.




-- a (a@a.a), July 09, 1999

Answers

You want another vote? I've been reading Cory since can't remember. I've never seen a thing he said that didn't make sense. This is all solid, this is how systems run.

As I said on another thread, some stuff IS going to get fixed in a few hours, or a few days. But stuff that's designed in, 2-digit years in indexes and stuff like that, is going to kill companies. The Big Question, the Big Acceptance Test, is "can we fix it before we die?". Nobody knows.

A lot of this we-can-do-it stuff comes from people like an ex-boss of mine. He'd read some damn book on computers, come in full of exciting ideas, and we'd have to explain again how the world worked. Some comes from CS people who never saw the real world - the same ones who say there can't be a problem with 99 rolling to 00, because a byte actually can hold a value up to 256!

You gotta put in some console hours before you really hear how the big wheels turn.

-- bw (home@puget.sound), July 09, 1999.


Y2K's world-wide impact will be systemic, with multiple, simultaneous, parallel flaws of unknown dimensions or foreseeable practical consequences. Nevertheless, because of lack of leadership awareness and vision, managerial, budget and human resources restraints, wishy-washy testing if any, small, medium and large size simultaneous flaws are to be expected throughout, including showstoppers. The definition of "critical systems" and the role of the remaining 'non-critical systems' is something that y2k historians will have to dwell upon after the facts.

At any rate, Fix On Failure cannot cope with this scenario simply because these concurrent negative events cannot be logistically foreseen, neither in personnel, spare parts and equipment, raw materials, inputs, etc. This includes hardware, software, firmware, and embedded systems plus many other operational items.

Cross contamination, domino effect, the impossibility of isolating a group of network components (for example, failing banks or key JIT vendors) without disrupting the very nature of the network itself, are just some of the additional problems that y2k management will face. As the Fed President has wisely stated: "99% compliance is not enough, 100% is the minimum" while referring to the international banking system composed by more than 200,000 banks.

Furthermore, to make things worse, there are three y2k problems:

(1) y2k proper, the code is broken and time's up. (2) people's perceptions and reactions to y2k (3) Wise up and cheating because of y2k ("I can't pay you this month")

There are 50,000 mainframes loose out there, 50 billion embedded chips, and 100 million computer systems. Food for thought.

-- George (jvilches@sminter.com.ar), July 09, 1999.


George, I would add sabotage to your list of variables.

-- Brooks (brooksbie@hotmail.com), July 09, 1999.

After 33 years in the field, 7-10 failures per week in a medium/large shop seems almost conservative. Some (most) of these are repaired quickly; a few take much longer. Most organizations (I've worked with Federal, State, and local government, and with private industry) have a substantial degree of fault tolerance...until the payroll falters! So companies can bumble along with their usual 7-10 failures per week. The question becomes: what happens when the number reaches 100-200 per week? At some point, the disruption exceeds the fault tolerance level. That is what we are trying to both avoid (through remediation and contingency planning). If the disruptions become extreme, we will need extreme contingency measures...like a substantial amount of water, rice, beans, SPAM, cola, twinkies, and coffee stored. (Any company that doesn't recognize that programmers work mostly on twinkies and coffee could be in real trouble.

-- Mad Monk (madmonk@hawaiian.net), July 09, 1999.

Wow! Great challenge: Indeed, as the pollys claim, if in fact fix-on-failure will work speedily, WHY NOT just take the next upcoming 3-day weekend (Labor Day) and declare it to be National Y2K Fix-it Weekend. Heck, it might actually be fun, turning the clocks forward on everything ... companies could encourage families to come in and watch as all the little pesky, minor Y2K problems get cleaned up by lunchtime Saturday. Sunday for testing. Monday for testing with everyone else. What a wonderful, fun-filled weekend that would be!

-- King of Spain (madrid@aol.com), July 09, 1999.


Brooks, good point. There are bound to be some "disgruntled employees" who have left or will leave little timebombs in computer systems. Look at how "going postal" has made its way into the vernacular--how many sub-postal folks are out there, not willing to spray bullets in the workplace but willing to spray bad code in the systems?

-- Old Git (anon@spamproblems.com), July 09, 1999.

Cory hit the description dead-on about life in a mainframe shop in the wee hours of the morning. I know...22 yrs in the Production Control trenches. I work at a credit union now - when the "usual" abends occur, its almost always due to either bad input that got past the edits or "something got changed"...be it a LRECL, a copybook, a new program compile that did not quite cover all possible error scenarios, etc etc.

Life gets REAL exciting when the first few abends hit your screen. You can fix some, but gotta call for support from programmers (who LOVE those 2 am phone calls ). What? No dial-up capability? OK - see ya in a hour or so onsite. OK - Problems 1 thru 4 fixed. Do the recompiles. Restore from backups. Oops..tape read error. No problemo, pull the other backup. Whaddya mean, its "offsite"? Heavy sigh - FedEx, here we come. OK - 2 hrs has gone by with JCL fixes, recompiles and program merges. Programmers now onsite with REAMS of listings flipping thru looking for that odd address space listed in the SOC7 134K line printout. Now the fun begins - Ops calling asking for status...Data Control wants to know when are they gonna get reports to balance? Network Ops calling saying folks getting pissed - cannot access accounts on their ATM's. Managers, who you called 3 hrs ago to tell them "Hey, guess what?" are now calling asking for status...

You look up at the clock...4-5 hrs gone. You look at your 4 pages of scrawled notes for your nightly log under the now-cold burger with a bite taken out of it. You have 8-12 people now clustered around you frantically doing workarounds on 4 different abends. This is just ONE subsystem with everything else working fine, mind you.

I laugh when I hear companies boast about "Mission Critical" systems being done. Did that include MVS? Halon? Network? The friggen card readers so your people can get into the building? Right - Company ABC has all 37 mission critical systems done. What about the 116 "non- mission critical" systems that provide input files to the mission critical ones? Duh.

Production Control: 8 hrs of boredom punctuated by an hour of sheer terror. Is 10% night differential REALLY worth it...?

-- JCL Jockey (WeThrive@onStress.com), July 10, 1999.


To everybody on the thread:

All too familiar, all too true, all too depressing. :-(

-- Lane Core Jr. (elcore@sgi.net), July 10, 1999.


JCL:

Surely one of the elements of the assessment process was to map the data dependencies between systems? It's been about two years since Arnold Trembley wrote of this process to identify all systems that the critical systems wouldn't run without as being also critical. Was his shop one of those rare ones where people actualy understood the task and knew what they were doing? Or is Cory more nearly correct, in saying that everyone who really understood the system has long since been downsized in favor of (much cheaper) blind rulebook followers?

-- Flint (flintc@mindspring.com), July 10, 1999.


Flint -

To be brutally honest, it's a combination of factors. On one hand (depending on your organization mission), you have different definitions for "mission-critical" - is it anything that produces revenue? Or is it anything that satisfies the customer's needs? Credits? Debits? Monthly Statements? Mortgage/IRA/Roth accounts? Where do you draw a final line and say (with authority) "These we can absolutely do without"?

I guess the point I was trying to make (and failed miserably it appears), was that on a "good day" a series of abends can take time and valuable resources (i.e. programmers) away from the "rest of the big picture". Now throw in a major abend or two across multiple subsystems and you have the cliche of 7 juggled balls in the air, all equal priority...which to drop first? Now spread this across a holiday weekend (as Y2K will cover). At "some" point, your first and second-string oncall and onsite folks are gonna get swamped. Only so many hours in a day...ripple effect.

Some folks say "No problem - pack it up and lets go to Sungard". Suuuure...you and 35 other companies. Lessee...how many data centers they got? 8? 12? Do the math. (and pray they have enough DASD and tape drives for ya)

-- JCL Jockey (WeThrive@onStress.com), July 10, 1999.



JCL:

From your excellent description, it appears there will be some very fancy dancing going on. OK, what process just abended? Do we really need this? What subsequent processes therefore won't be able to run? Can we do without them for now? Is there some quick and dirty way to persuade this process to ignore such errors and hope for the best? How well do the higher mucky mucks understand the consequences of crossing your fingers and slogging onward through the fog? Is there anyone who can accurately prioritize among unacceptable choices, if that's all you've got?

I admit I'm glad I won't be there, although most likely I'll be burning a few NOPs into ROMs to accomplish the same thing. For sure a nerd's life won't be a happy life for a while.

-- Flint (flintc@mindspring.com), July 10, 1999.


Flint,

Your "hope for the best" and your "quick and dirty way" approach emphasizes once again your total lack of perspective (despite your obvious intelligence, what a waste!).

Flint, your post above spells out one thing: TROUBLE. I mean BIG TIME trouble. There are 50,000 mainframes loose out there ready to screw up, 100 million computer systems, and 50 b-b-b-billion embedded microchips. On top of it all, Y2K circumstances, long week-end, huncky dory celebrations, 37 milllion people jammed up into Italy for the year 2000 Jubilee organized by the Vatican, terrorists ready to take advatange of the occasion, etc., etc., will make Y2K's impact far worse.

-- George (jvilches@sminter.com.ar), July 10, 1999.


JCL Jockey has provided one of those rare descriptions of "how it is" that only 1 in 100, even doomers, really grasp. This is why Cory has the sweats, exactly. And a. And me. I have always said (though I admit it is hyperbole), let all the PCs fail but we need the 2,000 or so enterprise systems that "run the world" to work or we're cooked.

It will turn out, in the end, that the achilles heel of all this, even for "compliant companies", are the so-called non-mission critical systems. They "inter-penetrate" the mission-critical systems at thousands of data choke points. I have never seen even one large enterprise that thoroughly mapped out its systems (heck, even understood them). This for the same reason Y2K is a nightmare in the first place: a weird archaeology of OS, languages, versions, patches et al. Until now, it has been far too much trouble to do so BECAUSE YOU COULD RELY ON THE GEEKS TO KEEP THE PLACE RUNNING. Simple business-financial trade-off. But fatal for Y2K.

I'm afraid we're cooked. It's gonna be more than 5 Fortune 500 companies for sure. The problem is, past a certain point, it begins to go Milne and we don't know that that point is.

Not too long now.

-- BigDog (BigDog@duffer.com), July 10, 1999.


BigDog commented:

"It will turn out, in the end, that the achilles heel of all this, even for "compliant companies", are the so-called non-mission critical systems. They "inter-penetrate" the mission-critical systems at thousands of data choke points. I have never seen even one large enterprise that thoroughly mapped out its systems"

BigDog, your right, I find this extremely difficult to comprehend. When the powers to be sliced out their Mission Critical systems was there no thought put into these interfaces? Do you mean to tell me that these folks did not have a handle on the BIG picture? Or was it an impossible task from the beginning?

Ray

-- Ray (ray@totacc.com), July 10, 1999.


I am not saying that "no thought" was given to interfaces. In some cases, I'm sure a great deal was. But consider both the logic and the politics.

Logic: no huge enterprise system is seriously documented. Some pieces are doc'ed in great detail (spec, design, code, test cases); others some detail; others, none. Every org has a different understanding of what "mission-critical" means and there are multiple org's within an entity. I doubt (though I could be wrong) that a master "standard" analysis was done within most entities -- the central authority and defining standards didn't/don't exist for doing such a Y2K "requirement". Or, at least, they didn't when folks started. In the BEST cases, we could probably do it now. But it's too late. While various orgs converged together within an entity as Y2K remediation progressed, this was large a "topsy-like process". BTW, this is why a year of testing was CRITICAL, not optional -- that year would have uncovered the crap with the "non mission-critical interfaces". Now, we'll find out much closer to rollover or after rollover.

Politics: except in the rarest cases, Y2K remediation was/is scut work. This has affected which systems were targeted for remediation (think budgets) from the beginning. Most orgs worked top-down ("these are our 10 most important applications") rather than from a data-centric design architecture. But AGAIN, 90% OF FORTUNE 500 ORGS HAVE NO design architecture for their "systems as a whole".

They literally "don't know what they have" from a data perspective.

So, yes, in one sense, Y2K remediation has always been impossible. IMO, this is one of the chief reasons that it has been denied and/or deferred at so many entities. In another sense, it is as easy as pie: start in on an app and start flailing away at dates. This is the way "real remediation" has been carried on, mostly.

No wonder the mainframe dudes are scared to death. The gap between this and the kind of baloney that is thought to be remediation by the network/pc weenies would be funny if it wasn't so awful.

Moreover, the engineers, by and large, don't "get it" because they have no idea how chaotic these environments are. To repeat myself once again, it's the human beings, geeks/geekettes, who are the real "computers". They "compute" how to keep the rig going. And they do an awesome job (cf. the world economy).

There ain't enuf of them to stick their fingers in the dike. The dikes will collapse to some unknown extent.

By March, 2000, we'll begin to know what we've got. If we're extraordinarily lucky, the geeks will find the pattern in the madness and begin to take charge. If we're extraordinarily unlucky, we've got Infomagic (I don't care how much he is mocked, the scenario is possible). Most likely, we're just gonna have one God-awful mess for five to ten years.

-- BigDog (BigDog@duffer.com), July 10, 1999.



BigDog, thanks for your thoughts. I worked on the biggest mainframe business systems around back in the 60s and 70s. The Cardinal rule back then was DOCUMENTATION. System flowcharts were a must. They showed ALL of the interfaces and were kept current. A job was not implemented without this documentation.

The time involved in keeping the documentation current was a considerable piece of the entire pie but it was required. I'm sure that some changes never were reflected and maybe as time went by and budgets were tight and larger portions of the systems went undocumented. If this is the case today with the Enterprise Systems I can well understand your concern.

Ray

-- Ray (ray@totacc.com), July 10, 1999.


"The time involved in keeping the documentation current was a considerable piece of the entire pie but it was required."

Exactly. This converges with a's experience (Capers Jones has built a career on this) of the increasing complexity of software. Keeping doc up-to-date has become exponentially more difficult. Automated tools have been long-promised but none have shown up that are worth their salt.

-- BigBog (BigDog@duffer.com), July 11, 1999.


Moderation questions? read the FAQ