Safety Moment #52: The Important Few / The Unimportant Many

Fault tree for identifying the important few

It is a truism that all organizations have to operate with limited resources. And this truism applies to process safety as much as it does to any other business activity. Management may declare that safety is the top priority and that they will do “whatever it takes” to achieve safety goals. But the reality is that the safety programs can command only so much money and take only such much of the time of senior personnel. Some method of prioritization is needed.

The traditional way of ranking hazards and recommendations has been to use a risk matrix. But, as noted in Safety Moment #51: Limitations of Risk Matrices, such matrices may not be as helpful in deciding on follow-up actions as might be thought. The reason for this is that most of the findings that need to be sorted and ranked wind up in the middle of the matrix with roughly the same risk rank.

Additional means of sorting out priorities are needed. Given that a hazards analysis can generate a very large number of hazards — each of which can have a large number of potential solutions or recommendations — where does a process safety professional start? Which lines of attack will yield the greatest benefit and which approaches are of very little value?

Process safety and risk management professionals spend considerable amounts of time and effort conducting hazards analyses and related activities, such as incident investigations. These activities typically generate a large number of findings and recommendations. Which of these are the “important few” and which are the “unimportant many”?

The Pareto Principle

Pareto Vilfredo
Vilfredo Pareto

The Pareto Principle — also known as the 80/20 rule — states that, for any particular event or outcome, approximately 80% of the effects come from 20% of the causes. The principle takes its name from that of the 19th century economist, Vilfredo Pareto, who observed that 80% of the land in Italy was owned by about 20% of the population.

The principle is observed in business. For example, the following generalizations often hold true.

  • 80% of a company’s sales usually come from 20% of its customers.
  • 80% of a company’s sales are made by 20% of its salesforce.
  • 20% of the workers are involved in 80% of the accidents.
  • 20% of the equipment items cause 80% of the facility shutdowns.
  • 20% of a company’s products will account for 80% of the total product defects.

The Pareto Principle is empirical — the ratio 80:20 is not exact — indeed, the principle is also referred to as the 90/10 rule. Mathematically, it can be expressed by the following equation:

log n = c + ( m * log x )

where n is the number of items whose value is greater than x; c and m are constants.

One commonly-held misconception to do with the Pareto Principle is that 80% of the problems can be resolved with 20% of the resources. In fact, the Principle makes no statement at all as to how much effort is needed to address the contributing factors. This understanding is important when ranking findings and recommendations generated by hazards analyses. An item which has low ranking can still be addressed if doing so requires very little effort or investment. For example, if a hazard can be addressed by simply writing a short operating procedure or by painting a yellow line then it is not worth the bother or risk ranking — it is simpler just to take the necessary action.

Finding the Important Few

A method for identifying the “important few” when ranking hazards is fault tree analysis. A detailed example is provided in the book Process Risk and Reliability Management, and in the associated ebook Frequency Analysis. Provided below is a much abbreviated version of this example.

Consider the system shown in the sketch below.

Process flow for fault tree analysis

Liquid is pumped from Tank, T-100, to Vessel, V-100, using Pump, P-101A, which is electrically driven. If the pump should shut down for any reason, P-101B, which is steam driven, and which has 100% capacity, takes over. The flow rate is controlled by FRC-101, whose set point is cascaded from LRC-101.

A hazards analysis team determines that the event “Tank, T-100, overflows” has an unacceptable high risk and that corrective action is needed.

But what action should be taken? Where should the effort be expended? Is it best to work on the instruments? On the operator’s actions? On the pumps? Intuition and common sense do not provide an answer.

Analysis of this system provides the following failure items, along with their predicted frequency or probability of failure.

P-101A                                                                                        0.5 yr-1

P-101B                                                                                        0.1

Instrumentation Plugs (common cause effect)                            0.25 yr-1

Internal Failure (LRC-101)                                                          0.15 yr-1

Internal Failure (FRC-101)                                                         0.13 yr-1

Internal Failure (Level Alarm)                                                     0.5         

Operator Busy Elsewhere                                                           0.1

Operator Reads Wrong Gauge                                                   0.01       

Some items, such as P-101A, have failure rate values; others — the safeguards such as P-101B — have a failure-on-demand probability.

Analysis of this system comes up with a fault tree.

It is shown below split into two parts because the single tree would be too large for one screen or page. The first part of the tree represents the events that could cause the tank to overflow. The second part shows the safeguards that help prevent this event from taking place.

Fault Tree for the Important Few
Fault Tree Page 1
Fault tree for the important few
Fault Tree Page 2

The tree is broken down into cut sets which can be used to risk rank the base events. This is done by calculating the risk when the item is part of the tree, and then when it is excluded from the tree. Removing a safeguard from the tree is equivalent to saying that the probability of failure is 1.0.)

This calculation is carried out for each base event. The results are shown in the following Table.

Fault Tree Cut Sets

Some conclusions that can be drawn from this abbreviated analysis are:

  • Instrument pluggage is the most important base event. It constitutes more than a third of the overall risk and should receive the most attention. 
  • Operator issues total only 5%. They are insignificant. Even if they could be made to go away altogether the system risk would be reduced by only 5%. It is not worth spending time and effort on procedures, training or increasing the availability of operators — at least for this particular situation.
  • Once the problems with instrument pluggage are addressed, the next priority is fixing the internal failure problems associated with the low level alarm.
  • The difference in rankings are so great that, even if the basic reliability data is not of good quality — a likely occurrence — the results would remain the same, regardless of what values are selected.

So, in this example, instrumentation problems are the “important few”. All the other items are the “unimportant many”.

Copyright © Ian Sutton. 2018. All Rights Reserved.