Hamasaki: The three classes of enterprise system failures

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Attention Hoff, Chicken Little, FactFinder, Decker:

Subject:Re: critique my 'Millenium Y2K' beginning please. A.H. damn it, be kind :)
Date:1999/07/12
Author:cory hamasaki <kiyoinc@ibm.XOUT.net>
  Posting History Post Reply


On Mon, 12 Jul 1999 08:04:03, Whistlin' Dixie <me@whocares.com> wrote:
 
> But it took 8-10 hours, and it was fun?  LOL  Welcome to the workaday
> world.  Things break, and they get fixed.  Once again, welcome to the
> workaday world.  Sorry if a minor crisis caused you to miss a friggin'
> do-nut and a cup of coffee because you were under stress, but hell,
> welcome to the workaday world.  Glad y'all kept things running.  I do it
> every day.
 
I have no idea what the paragraph above is about.  Big Arnold works for a large financial services IT shop.  It's important because if it fails, commerce stops.  I understand clearly what context Arnold is discussing.
 
But I'm not jumping in because some tin whistle is going -tweeeet-, I'm commenting because we've identified the heart of the problem.  Rather Arnold has.
 
For you newbies, I've never understood the scope of the embedded problem, TELCOs, power, etc.  I've followed a lot of the prattle on the Y2K news and discussion sites, in the trade press, and from the academics.  The chatter from both the pollies and got-its hasn't been convincing.
 
The Y2K problem has always looked to me as an extended enterprise systems failure.  This kind of thing happens very, very rarely.  Here's why.
 
  Why Enterprise Systems Don't Fail.
 
From the dawn of time, large systems were designed to be fail proof.
 
The hardware, from the data/logic path level, up to the way data centers are configured was designed that way.  Internal datapaths are error detecting and error correcting,  hardware and software is designed so that you can remove and replace parts of functioning systems.  There is a thriving business in off-site data storage and disaster recovery services, either you go to Sungard, they have a site in Philadelphia, or they'll drive a mainframe in a tractor trailer to you.
 
Some firms maintain twin processing sites but for all the duplication, redundancy, big-arnolds on beeper duty, the one thing that they don't have is a way of fixing fundamental, structural, designed in wrong from the dawn of time, architectural flaws.
 
  Why Enterprise Systems Will Fail.
 
There are three classes of these errors:
 
1. Basic  yesterday < today logic problems that will fail when 99 < 00 shows up for the first time.  This logic may be scattered throughout the source, in too many places, and too obscure to fix in, oh, 2 or 3 hours.
 
In some cases, these problems can be fixed in a few days if the
programmer is familiar with the system.  Persisting anomalies frequently take this long to identify and repair.
 
This is very different from the case of a bug reported during a production run.  Production bugs tend to be isolatable to single data records even though the fix is sometimes to add a few lines of logic to an input edit routine.
 
A class 1 error can put a company out of business but most likely the consequence will be a memorable crisis.  These will be fixed by the second week.
 
2. Data-architectural problems, several people have mentioned years in database keys and VSAM indices.  This also shows up in the data side, large sequential files, etc.  Anywhere that a two digit year appears in data, there is a strong chance that the fix will require a large effort to fix.
 
For down to the bare metal types who are doing search key less than or equal, this may be an intractable problem.  This specific problem is not the general case but if it occurs in a key player in a key industry, it is two steps away from taking down the civilized world.
 
Class 2 problems will take weeks to resolve, perhaps months.
 
3. Fundamental structural failures, Data-exchange, and cascading Flub on Failure problems. This is the unknown.
 
State engines and proving programs correct depends upon clearly
knowing the initial state of the system.  From a known starting state and given the logic, the resulting states are predictable and
computable.
 
When the roll occurs, systems will transition to a new, unknown state for the first time.  In some cases, the problem will be that part of the system is made of remediated code operating on expanded data while another part is unremediated. 
 
Company to company data exchange fits this category.
 
As problems are identified and fixes made in oh, 2 or 3 hours, the fixes will be put into production without QA, without review, in the middle of the night, with Ko-skin-em types screaming in terror, their big-brains awash with red mindless fear, that the system *must* be fixed.
 
This will introduce subtle new problems and the cycle will begin again.
 
More fixes, more screaming, the clock will run.
 
Class 3 problems will take months to resolve, perhaps these will never be repaired and will down the most resilient enterprise.  A class 3 problem in the right place can take down the civilized world.
 
(Note to H.B. types and other writer wannabes dragging c.s.y2k for material, consider this copyrighted.  Flames here, email reproducion and reposts to other forums for additional critiques is encouraged under fair use as long as proper cites are maintained.)
 
 
> Arnold Trembley wrote:
> >
> > Read what Cory Hamasaki said, and what James Johnson said.
> >
> > I get the primary beeper about one week in eight.  Last time I had it, I
> > wasn't beeped a single time.  That was considered somewhat unusual.
> >
> > Given that we run thousands of jobs a day, and 90+ CICS regions, I'm
> > certain we have more than 7-10 "problems" per day.  This kind of metric
> > depends somewhat on how many jobs you run in a day.
> >
> > Much of the time I am beeped just to advise me of an event the operators
> > or production controllers can fix on their own.  Occasionally, I am
> > beeped to request permission to take a specific action to recover from a
> > particular problem - "we have a task suspended for more than five
> > minutes in region XXXX", so I tell them to purge it.  That doesn't seem
> > to happen anymore, since we upgraded our third party database.
> >
> > Sometimes there's a more difficult problem, jobs run out of sequence,
> > schedule conflicts because one thread was delayed, ran several hours
> > late, and collided with a different job schedule that had
> > interdependencies.  Sometimes a tape goes bad and can't be read, or the
> > heads get dirty and the drive won't read.  Or a database has locked up
> > and a job goes down.
> >
> > If I have to come in at 3:00 AM, generally the most I'll have to do is
> > write some JCL, dump a file, figure out what's wrong, read a manual,
> > restore a file, page the database group, or assist the production
> > controllers in rerunning a complicated schedule.
> >
> > Rarely, I may have to fix a data file.  Maybe once a year I'll have to
> > fix a program in the middle of the night, and usually I get that call on
> > the secondary beeper, because the odds are only 1 in 8 that I will be on
> > the primary that night.
> >
> > The trickiest part is notifying upper management if there is customer
> > impact or the problem is unresolved after one hour.
> >
> > We don't get storage violations on our CICS regions.  Last February a
> > region crashed with a program check in the CICS nucleus.  Nobody ever
> > figured that one out, but the region was restarted before the analysis
> > began.
 
I'm assuming that the nucleus stayed up. Do they still call it the "Control Region"?  Sidebar, there are known ways for applications to take down supposedly bulletproof operating systems.  I am certain that it is possible for a badly behaved CICS transaction to bring down the entire subsystem.
 
> > The point is, these problems happen all the time, and the jobs are
> > designed to be restartable and recoverable.  But it's not very likely
> > that a date rollover problem will fall into this category.  The leap
> > year bug in 1992 took a week to correct, and the workaround for the
> > failing job took 8-10 hours.
> >
> > We're going to have some fun.
 
  Why This is Bad News.
 
Yes, and here, big arnold, who might be a mainframe polly, has
identified the crux of the matter.
 
For best of the best enterprises, Y2K will be their worse nightmare, a combination of decade roll over, leap year, first year end with the major new system.
 
This is the typical IT mess, jobs failing, chaos, but by gum, we're IT professionals, we'll eat donuts, kill trees, and get this jobstream running and the big boss will pat us on the head and remember what we did for, oh, 2 or 3 months.
 
For other than best of the best, you gotta ask yourself, will it be Class 1, Class 2, or Class 3?
 
Class 3, where have I heard that before?
 
> > --
> > Arnold Trembley
> > http://home.att.net/~arnold.trembley/
> > "Y2K?  Because Centuries Happen!"
 
We were out of time at Day 500 for a good outcome.  Since then, the blather from the Broomies has gotten stranger and less entertaining.
 
I'm still hoping that it won't go Milne or InfoMagic but each day that I see more Broomie Baloney, I am less certain that there will be a good outcome.
 
With each new Polly Prattle Piece, my confidence in our ability to recognize and fix problems is shaken.
 
How many DGI's are there who think they're thinking but are only firing a few stray wistful lonely synapses, their brains kicking like a galvanic frog's leg.
 
  What Can You Do About It?
 
I sure don't know.  Tuna is on sale again.  Clorox is cheap.  You can invest a few hundred dollars and spend some time fixing up your great aunt Louise's farm.
 
You should be fixing up your aunt's farm anyway.  You really don't need to go on that cruise or watch another stupid NASCAR race when you could bushhog her back lot, teach your kids about nature, sip lemonade on the farmhouse porch and look at the family photos.
 
Saaaay, uncle bob sure doesn't look like anyone else in the family, in fact, he looks like the appliance repairman....
 
cory hamasaki http://www.kiyoinc.com/current.html




-- a (a@a.a), July 12, 1999

Answers

I try to work at the farm every weekend, but in the middle of the hot afternoon, it's okay to take a break and watch another stupid NASCAR race!

-- Dog Gone (layinglow@rollover.now), July 12, 1999.

Why do you people listen to anything Cory HaveADonut says?

He blew the Joe Anne Effect thing. Why are you listening to him now?

Cory Haveadonut runs a software store in Annapolis Maryland.

-- A-Doobie (doobie@doo.net), July 13, 1999.


Moderation questions? read the FAQ