Do you know your Apps Single Point of Failures?

Based on my experience a high availability analysis (HA) will help you to identify weaknesses in your IT services. Conduct it regularly because small changes in your landscape can have an enormous impact on the reliability of your business services. Once you’ve identified a gap between expected and actual availability I highly recommend executing a single point of failure analysis (SPOF).

I will give you now some simple steps which you can use right away followed by a complete HA and SPOF analysis example.

An outage of sub-services will have an impact on the entire system reliability.
Therefore, verify now all your sub-services regarding:

  • expected failures
  • preventive measures
  • business impact of an outage

If there is no preventive measure available, you’ve identified a single point of failure. Try to eliminate any SPOF as soon as possible to avoid critical outages. Good failure detection and redundancy will help you to improve the reliability of your business critical application.

Sample SPOF Analysis

As mentioned above, we use the single point of failure analysis to verify our services in terms of potential errors, mitigations and the business impact of outages.

Our sample application used for this analysis consists of three components. The table below contains the result of this SPOF analysis for our sample application.

spof

We’ve identified a single point of failure because there is no mitigation for a hardware failures available.

I’ve prepared a high availability and single point of failure calculation matrix which you can use right away.

Advertisements

A Guide to verify High Availability Requirements

Our service portfolio is increasing, and frustrated users can nowadays quickly move away to more reliable sites. Besides this commercial aspect, downtimes can also result in serious problems with a regulatory agency. I’ll give you now some simple steps you can use right away to conduct a high availability analysis followed by another post how to identify single point failure risks in your critical services.

A high available system continues it’s business function even if a sub service crashed. Any single point of failure (SPOF) risk is not acceptable in such a system. Therefore you should consider redundancy of critical components and fault tolerance of your core system. Usually, a high availability analysis starts with a comparison of expected and actual availability followed by a single point of failure identification.

Expected Availability

The expected availability is often part of service specifications or contracts with service providers. Alternatively, you can use this formula to calculate it:

Availability = MTBF / (MTBF + MTTR)

Service Time: Hours service daily available for your users

MTTR: Mean time to repair

MTBF: Mean time between failure (Service Time – MTTR)

Let’s assume that our CRM application users are using the application 21.5 hours a day, 20 days a month and the monthly downtime is 4 hours. What is the availability of our CRM application?

Service Time: 25800 minutes per month

MTTR: 240 minutes per month

MTBF: 26800 – 240 = 25560 minutes per month

Availability: 25560 / ( 25560 + 240) = 99.07%

Actual Availability

Our applications consist of many sub-services such as database, middleware, load balancer, network and external service providers. An outage of a load balanced server will eventually not have an impact on the reliability while a crash of a single database will often result in a downtime. The former is called availability in series and the latter is called availability in parallel.

Availability in series:    A = Ax Ay
Availability in parallel: A = 1 (1 – Ax)²

Typically, the actual availability of a given system will be calculated in the following four steps:

  1. Prepare a block diagram of the system
  2. Develop a reliability model of the system
  3. Calculate the availability of sub-services
  4. Determine availability of the entire system

The block diagram will help you to visualize the essential components. Once you’ve created the visual representation of your services, you’ll derive the reliability model by identification of serial and parallel availabilities. Finally, you can calculate the availability of your sub services and the entire system.

Let’s continue with our given sample CRM application which consists of three sub-services and their service time is 21.5 hours. Sub-service 1 and Sub-service 3 have a MTTR of 240 minutes and Sub-service 2 has an MTTR of 480 minutes per month.

What is the actual availability of this CRM application?

Step 1: block diagram

Our CRM application consists of 3 sub services.

blockdiagr

Step 2: Reliability Model

We have no loadbalancer or service redundancy in place – so its a availability in series.

blockdiagr

Step 3: Availability per sub-service

S1: 25560 / ( 25560 + 240) = 99.07%

S2: 25320 / ( 25320 + 480) = 98.14%

S3: 25560 / ( 25560 + 240) = 99.07%

Step 4: System Availability

Availability: 99.07% * 98.14% * 99.07% = 96.32%

The example above demonstrates how to compare expected with actual availability. Keep in mind that there are measures such as proactive error detection and redundancy which improves availability and others such as complexity have an adverse impact on system reliability.

In my next post, I will outline a single point of failure analysis and provide a high availability calculation sheet which you can use right away.