I have a question that I hope someone can provide some insight to. When I establishes the SL uptime number of a particular system capability in an SLA I typical include fault tolerance analysis which identifies a single point of failure in a system. Based on that SPOF we define the SLA uptime the system is "designed" to achieve and make that our number.

Example could be that we have a single database server and we know that the worst case scenario could be a disk failure and thus we know to recover from this scenario could take 8 hours (approximately 99% over a monthly timeframe). How does one take into account things such as probability? For instance if I know that the prior risk is present however this particular DB server has multiple disks one could argue that the likelihood of this occuring is minimal and thus shouldn't predict the SL the system is "designed" to achieve.

Please help with some insight, probability equations, or whatever else helps an SL manager come to an appropriate SL uptime number.

Regards, Steve

January 27, 2003


Steve, It sounds to me like you are making this too difficult. Instead of failure point analysis, probability equations, etc., why don't you simply analyze the availability (SL uptime number) that you have been delivering?


Rick Sturm

February 25, 2003.

