learn · South Africa

Alarm priority design: cutting the shift alarm flood

Alarm priority design for SA control engineers — deadband, on-delay, latching, and a Pareto trim taking a 600-an-hour flood to a 20-an-hour shift list.

Alarm priority design is the part of control work that almost nobody on a brownfield SA plant has had time to do properly, and almost everyone pays for it on the night shift. The standard pathology is the same on every site: a control room with three operator stations, a 24-inch alarm summary banner that scrolls past too fast to read, and a printout from the historian showing 437 alarms in the last hour during a routine batch changeover. The operator silenced 412 of them without reading. The other 25 buried the one alarm that actually mattered — a primary cooling-water flow trip that nobody saw until the temperature had risen 14 degrees over setpoint. The maths of alarm flooding is unforgiving and the fix is not more screens. The fix is fewer alarms, with priorities that the operator trusts. This tutorial walks through the rules.

Try the simulator →

Why this matters on real plants

ISA 18.2 — the international standard for alarm management — gives a target alarm rate of about 6 alarms per hour during steady-state operation, with a flood threshold of 10 alarms in 10 minutes. We have walked into petrochem, mining, and FMCG sites where the steady-state rate was 60 to 200 per hour and where every shift handover started with the outgoing operator saying "ignore everything in red, it's been red for two weeks". That is not an alarm system. That is wallpaper. When the screen looks the same whether the plant is healthy or on fire, the operator stops looking, and the next real fault gets answered after the relief valve has already lifted.

The cost of getting alarm priority wrong scales with how busy the operator already is. On a quiet plant on a Sunday afternoon at 14:00 with one batch in progress, even a flood is survivable — the operator has time to read every line. On the same plant at 02:00 during load-shedding stage 4 with a diesel kicking in, three batches stacked, an HMI lag from the network ring being briefly partitioned, and a bad weather front running through the area, an extra ten alarms a minute is the difference between a controlled shutdown and a recordable incident. Alarm priority design is not a nice-to-have. It is the operational discipline that makes the night shift survivable.

The third reason it matters more in SA than in textbook examples: instrumentation maintenance budgets are tight, and many transmitters that should have been replaced years ago are still running with degraded signal-to-noise. A pressure transmitter with a 2% noise band on a 4-20 mA loop will trip a high-pressure alarm seven times an hour just from electrical noise, and an operator who has watched that alarm cry-wolf for a month will silence it before reading the value the eighth time, when it is real. Deadband and on-delay are the cheap fix. They cost nothing in capex and they pay back the first night they keep an operator from chasing a phantom.

The mental model

Every alarm that reaches an operator passes through the same four-stage pipeline. Get one stage wrong and the alarm is either too noisy or too quiet, and either failure mode degrades the operator's trust in the whole system.

The first stage is detection — comparing a process value against a setpoint. The naive form is a single comparator: trip the alarm when level > 4500.0. The mistake is using a bare comparator on a noise-prone signal. A better detection is a comparator with deadband — trip on level > 4500.0, clear on level < 4450.0. The 50 mm gap between trip and clear stops the alarm chattering when the level is hovering on the threshold.

The second stage is persistence — requiring the trip condition to hold for a minimum time before the alarm is raised. This is the on-delay filter: a TON timer with a preset of two to ten seconds, fed by the deadbanded comparator. Spikes shorter than the preset are filtered out. A 50 ms spike on a noisy 4-20 mA loop never triggers an alarm. A genuine high-level event that holds for 30 seconds always does.

The third stage is latching and acknowledgement — once raised, the alarm holds in the active list until the operator acknowledges it, even if the underlying condition clears. This is what stops a self-clearing transient from disappearing before the operator sees it. The latch is a standard --(L)-- coil set when the persistence filter trips, and reset only when both the condition has cleared and the operator has acknowledged.

The fourth stage is prioritisation — every active alarm carries a priority field (typically 1 to 4, where 1 is "critical / safety" and 4 is "diagnostic / informational") that the HMI uses to sort the alarm summary, colour-code the row, and decide whether to push to a beeper or a voice annunciator. ISA 18.2 recommends a Pareto distribution: roughly 5% critical, 15% high, 80% medium-and-below. A plant where 60% of alarms are critical has no priorities at all — every alarm is equal, which means no alarm is special, which means the operator triages by recency rather than importance.

Worked example

Open the simulator. Drop a CompactLogix CPU on the rack and load the latch ladder preset. We are going to wire six alarms — one critical, two high, three medium — into the alarm subsystem and watch the priority field do its work.

The six alarms:

AL_001 Primary cooling water flow low — critical (priority 1, SIL-rated trip)
AL_002 Reactor pressure high — high (priority 2)
AL_003 Feed-tank level high — high (priority 2)
AL_004 Bearing temperature warning — medium (priority 3)
AL_005 HMI heartbeat lost on Station 2 — medium (priority 3)
AL_006 Conveyor speed below setpoint — medium (priority 3)

Each alarm is a structure in the PLC tag table:

AlarmRec : STRUCT
    Active     : BOOL;          // condition currently true
    Latched    : BOOL;          // raised, awaiting operator action
    Acked      : BOOL;          // operator has pressed ACK
    Priority   : INT;           // 1 = critical, 2 = high, 3 = medium, 4 = low
    Deadband   : REAL;          // engineering-units
    OnDelay    : TIME;          // T#3s typical
    FirstOut   : BOOL;          // first alarm in a cascade
    Timestamp  : DT;            // raised-at, set on transition
END_STRUCT

The pseudo-ladder for one alarm latch with priority and on-delay looks like this:

   |                                                                     |
   |  PV     PV>SP         dbHyst                          AL.OnDelay.IN |
   |--| |---[GTR]----------[NEG_HYST]------+-----+--------( )-----------|
   |                                                                     |
   |  AL.OnDelay.Q                                          AL.Latched   |
   |--| |-----------------------------------+-----+--------(L)-----------|
   |                                                                     |
   |  AL.Latched   AL.Acked   AL.Active                     AL.Latched   |
   |--| |-----------| |--------|/|-------------------------(U)-----------|
   |                                                                     |
   |  AL.Latched                                            AL.Priority  |
   |--| |--------------------------------------------------[MOV 1]------|
   |                                                                     |

Read it as four rungs. Rung one is the deadbanded detection feeding a TON on-delay timer. The hysteresis block (drawn here as dbHyst) is the standard SR pattern: set when PV > SP_high, reset when PV < SP_low, output stays latched between the two thresholds. Rung two latches the alarm record once the on-delay times out. Rung three unlatches the alarm only when the operator has acknowledged AND the underlying condition has cleared — that combination is the rule that prevents an alarm from clearing before the operator notices it. Rung four moves the priority constant into the alarm record so the HMI sorter has something to read.

Now run the scenario. The simulator's latch preset has a slider feeding the PV input and a button bank for the six alarm conditions. Push the cooling-water flow-low button. The PV drops below the SP_low threshold. The deadband holds the comparator true. The on-delay TON starts counting. Three seconds later the latch rung fires. AL_001 appears in the alarm summary, sorted to the top because Priority = 1, with a red banner and a beeper. Push the bearing-temp warning button at the same time. AL_004 appears below AL_001, sorted by priority then timestamp, with a yellow banner and no beeper. Press the operator's ACK button on AL_004 — the row goes from blinking to steady, the timestamp updates with the ack-time, and the row remains in the list because the underlying condition is still active. Release the bearing-temp button — now both conditions are met (cleared + acked) and AL_004 drops off the list. AL_001 stays. The priority sort is doing its job: the critical alarm is at the top, the noise is at the bottom, and the operator's eye lands where it should.

The Pareto effect shows up the moment you bring four more medium alarms into the list. Add AL_005 and AL_006 — two more medium-priority records. The summary is now five rows. Sort by priority and the critical one stays at the top, but the four mediums pile up. On a real plant, the top 5% of alarm sources cause about 80% of the alarms — usually two or three chattering signals on degraded transmitters. Identify those two or three sources, fix the deadband on each, and the medium-pile drops from five to one. That is the trim that takes a 600-alarm hour to a 20-alarm shift.

Common mistakes

No deadband on noise-prone signals. A 4-20 mA loop with a 2% noise band on the analog card will chatter a bare-comparator alarm 60 to 100 times an hour at threshold. The operator sees the same alarm repeating, silences it once, and stops reading the alarm name. Always set deadband to 1.5x the measured noise band of the loop. On the simulator the analog input has a noise-injection slider — set it to 2% and watch the alarm chatter; add the deadband and watch it stop.
Trip-and-trigger logic preventing the latch from clearing. A common rookie pattern is to wire the latch reset directly to the underlying condition going false, without the ACK gate. The alarm clears the moment the condition clears, and a 200 ms transient is invisible to the operator who looked away for two seconds. Always require BOTH the cleared condition AND the operator ACK before the latch resets. The four-rung pattern above does this — the unlatch rung has Acked AND NOT Active in series.
No operator-visible priority field in the alarm summary. A plant where every alarm is the same colour, the same size, and listed in arrival order is a plant where priority is implicit in the operator's head and disappears the moment a new operator takes over. The priority field on the alarm record must drive the sort order and the colour code on the HMI — that is the only place the priority does its work.
No first-out indication on cascading alarms. When a primary fault triggers four downstream fault chains, the operator wants to know which one tripped first — the root cause, not the four consequences. The FirstOut bit on the alarm record is set true on the first alarm raised in a 30-second window and cleared on every subsequent one, so the alarm summary shows a single root-cause flag. Without it, a cooling-water trip cascading into pump-trip / reactor-temp-high / batch-fault looks like four independent alarms and the operator chases the wrong one.
Priority distribution skewed to critical. A plant with 60% of alarms marked critical has no priorities at all — every row is red, the beeper never stops, and the operator silences alarms by the row rather than by reading them. Audit the priority distribution on the historian once a quarter and force a Pareto fit. ISA 18.2 recommends roughly 5% / 15% / 80% across critical / high / medium-and-below. If you are above 10% critical you have miscategorised somewhere.
No suppression on planned shutdowns or maintenance. A common operational source of alarm flood is a planned event — a unit going into recipe changeover, a line being washed, a section being isolated for maintenance — that triggers thirty incidental alarms because the suppression masks were never built. The alarm subsystem must support per-record suppression with a timestamp and an audit trail. Without it, every recipe change is a flood.

How to practise this in the simulator

The simulator's latch preset is built for this exact pattern. Open it, add the six alarms above as records in the alarm panel, and run the scenario. Push the buttons in different orders, vary the on-delay times from 100 ms (chatter) to 30 s (sluggish), turn the deadband off and back on, and watch the alarm rate counter climb and fall. Then add a sixth fault that cascades into three downstream faults — the simulator's fault-injection panel has a "cascade" toggle that fires AL_002, AL_003, AL_004 ten milliseconds after AL_001 — and watch the FirstOut indicator pick out the root cause. Twenty minutes of breaking and fixing teaches more about alarm priority design than reading the standard cover-to-cover.

Start the free tier →

Vendor reference

The cross-vendor reference for alarm management is ANSI/ISA 18.2 Management of Alarm Systems, which formalises the priority-Pareto target, the flood threshold, the deadband and on-delay recommendations, and the lifecycle from rationalisation through implementation to KPI monitoring. It is a paid standard but the executive summary is free on the ISA site and covers the rules above. Every modern alarm subsystem on FactoryTalk Alarms & Events, TIA Portal WinCC, AVEVA System Platform, and Ignition Alarm Notification implements the 18.2 model, with vendor-specific naming on the configuration screens. Read the standard once. The lifecycle vocabulary becomes the language you use in alarm-rationalisation workshops with the production team.

What we don't claim

This site is not SAQA-registered, not MerSETA-accredited, and not an NQF-registered qualification provider. Our completion certificates are course-level only — they describe what you covered, not an NQF Level X qualification. The CCST cert from ISA is the portable industry credential we recommend; we are not an ISA cert delivery partner either, but our cert packs are CCST-aligned. Alarm management is covered at a high level on the CCST Level 2 syllabus and we point at ISA 18.2 as the canonical reference for any deeper rationalisation work.