learn · South Africa

Fault-finding workflows: cause in under 30 minutes

Fault-finding workflows for SA engineers — a structured 30-minute diagnostic on a packaging line trip with bisection, logs, and root-cause discipline.

Fault-finding workflows are the part of the controls job that decides whether a Friday-afternoon callout ends at 17:30 or at 22:00. The packaging line trips. The operator says it does so every twenty minutes or so. The maintenance fitter has already swapped a sensor and the trip came back. Production wants the line running for the night shift. The next thirty minutes are the difference between a real root cause and a parts-swap exercise that hides the fault somewhere worse. This tutorial walks the structured diagnostic order — gather, reproduce, bisect, narrow, root cause — with five-minute time blocks and a specific simulator scenario you can run yourself.

Try the simulator →

Why this matters on real plants

Most fault-finding on a SA brownfield plant happens under time pressure with incomplete information. The line is down. The shift handover is in two hours. The drawings on the wall do not match the panel because somebody added a remote IO drop in 2019 and never updated the print. The maintenance team has already changed one component and is itching to change another. The temptation is to keep swapping parts because each swap is a visible action that makes the production manager feel like progress is happening. The cost of giving in to that temptation is a fault that gets buried — the original cause is still there, but now there are two new components in the loop and the symptom has shifted in a way that makes the next diagnosis harder.

A structured workflow is what stops the parts-swap death spiral. Thirty minutes spent on a disciplined gather-reproduce-bisect-narrow-root sequence beats four hours of guessing every time. The specific cost on the plants we have walked: an intermittent line trip that gets blamed on a sensor and ends up being a control panel running too hot, a randomly-resetting drive that gets blamed on the drive and ends up being a shared-neutral wiring issue, a SCADA driver that gets blamed for a comms fault and ends up being a switch port doing spanning-tree convergence every twenty minutes. None of those were the first guess. All of them were the third or fourth guess that the structured workflow would have reached in the first thirty minutes.

The third reason it matters more in SA than in textbook examples: the documentation gap. Plants that were commissioned in 1998 and modified continuously since then have drawing sets where roughly half the tags on the panel match the print and half do not. Fault-finding without a structured workflow on an undocumented plant turns into archaeology. Fault-finding with a structured workflow turns into engineering, because the bisection step gives you a way to interrogate the plant directly without trusting the print.

The mental model

Every intermittent fault sorts into one of four causes, and the order you eliminate them in is the difference between thirty minutes and three hours. The categories are: electrical (a wire, a connector, a contact, a power-supply rail), controls (a PLC input, a programme bug, a comms drop, a HMI interlock), environmental (heat, vibration, EMI, condensation), or load-side (the actual mechanical or process equipment doing what the controls told it to do, but in a way that exposes a marginal design).

The workflow has five steps with a five-minute target on each. Gather is the symptom-collection step — what does the operator see, what was the last thing changed on the plant, what is the time pattern of the trip, what is the alarm log saying. Reproduce is the does-it-still-fault step — run the line under your watch, with a stopwatch, and confirm the symptom matches what the operator described. Bisect is the electrical-or-controls step — split the system into two halves and ask which half the fault lives in, then split that half again. Narrow is the which-signal step — once you know the half, identify the specific signal or component that is misbehaving. Root cause is the why-does-this-signal-misbehave step — the component is the proximate cause; the root cause is what made the component fail.

The discipline is to do one step at a time, time each step, and write down what you found before moving on. The bisection step in particular fails the moment you change two things at once — change one thing, observe, write it down, decide the next change. Two changes at once and you have lost the ability to attribute cause. The third discipline is the log: every action you take, every reading you take, every theory you discard, gets a line in the shift log. The next engineer either finds the fault from your notes or repeats half your work.

Worked example

Open the simulator. Drop a CompactLogix CPU on the rack with a DI16, a DO16, and an AI8 module. Load the packaging-line scenario — a labeller, a capper, an inspection camera, and a reject pusher driven by a CompactLogix programme on EtherNet/IP. The simulator's fault-injection panel is set to "intermittent line trip every 20 minutes ± 3". Run the line. Within twenty-five minutes the line trips and the operator-view HMI shows a generic "Line Stopped — see fault log" message. The clock starts.

Minute 0 to 5 — gather. Read the alarm log on the simulator's HMI. The log shows two entries within a 200 ms window before the trip: LBL_HOME_LS lost (the labeller home limit switch went from 1 to 0) followed by E_STOP_SCAN fault (the safety scan saw an unexpected discrete-input transition). The maintenance log shows the labeller home limit switch was replaced two weeks ago. The operator confirms the trip happens about every twenty minutes and is not correlated with a specific bottle size or product changeover. Time check: 4 minutes 30. Notes written.

Minute 5 to 10 — reproduce. Run the line and watch the LBL_HOME_LS tag in the simulator's tag-monitor view. Set a one-second sample rate and a 30-minute trend. At minute 8 of run-time the trend shows the LBL_HOME_LS tag drop to 0 for 80 ms, then return to 1 — and the line trips on the safety scan a tick later. The drop is real. The fault is reproducible. The drop is short — under 100 ms — and clean. That clean shape points to electrical (a glitch on the input) rather than mechanical (a switch that physically loses contact, which would show a noisier transition). Time check: 10 minutes. Hypothesis written: "electrical glitch on LBL_HOME_LS input".

Minute 10 to 20 — bisect. Two halves: the input wiring (from the limit switch to the DI16 terminal) versus the input module itself (the DI16 channel, the 24 V backplane supply, the CPU's input image table). Bisect by swapping the LBL_HOME_LS signal to a known-good DI16 channel that currently has a spare input. The simulator's panel-walk view lets you re-terminate inputs. Move LBL_HOME_LS from DI16 channel 0 to channel 8, update the I/O tag mapping, download. Run for twenty minutes. The trend shows the same 80 ms drop, same time pattern, same trip. The DI16 module is not the fault — the fault rides with the wire. The wiring half of the system is now the suspect half. Time check: 18 minutes. Note: "DI16 module ruled out by channel-swap test".

Minute 20 to 25 — narrow. Walk the wire from the DI16 terminal back to the limit switch. The simulator's panel-walk view shows the cable runs through a 230 V cable trunk for about 1.5 metres before exiting to the field. The cable is a basic 24 V signal wire with no shield. The trunk has a contactor on the same run that pulls in every twenty minutes when the labeller indexes the bottle carousel — and the contactor's coil collapse sends an EMI pulse down the trunk that capacitively couples into the unshielded signal wire. The 80 ms drop is the duration of the contactor's coil collapse. Time check: 24 minutes. Root cause hypothesis: "EMI from labeller indexer contactor coupling into unshielded LBL_HOME_LS signal in shared trunk".

Minute 25 to 30 — root cause and fix path. The proximate cause is the contactor coil collapse. The root cause is the wiring decision made when the panel was modified in 2019 to share the trunk between the 230 V indexer feed and the 24 V signal wires. The permanent fix is to re-route the LBL_HOME_LS signal in a separate conduit, or to install a shielded twisted pair with the shield grounded at the panel end. The temporary fix — to get the line running for the night shift — is to install a 50 ms input filter on the DI16 channel for LBL_HOME_LS, which will mask the 80 ms glitch. The simulator's I/O configuration tree exposes the filter setting. Apply 50 ms, download, run. The line runs the rest of the shift without a trip. Permanent fix scheduled for the next maintenance window. Time check: 30 minutes. Total elapsed: 30 minutes. Cause found, temporary fix in place, permanent fix scoped.

The diagnostic checklist pseudo-code is what you actually run through:

(* Fault-finding workflow — 30-minute target *)

STEP_1_GATHER (5 min):
    Read alarm log    -> identify last-trip alarm sequence
    Read shift log    -> any recent maintenance changes
    Talk to operator  -> time pattern, product correlation
    Output: hypothesis category {electrical | controls | env | load}

STEP_2_REPRODUCE (5 min):
    Run line under watch, stopwatch on
    Trend the suspect signal at 1 Hz minimum
    Output: confirmed reproducible? Y/N. Glitch shape (ms, edges)

STEP_3_BISECT (10 min):
    Choose split: wiring vs module, or program vs IO, or comms vs app
    Swap one variable only
    Run again, observe
    Output: which half holds the fault. Other half ruled out.

STEP_4_NARROW (5 min):
    Within suspect half, walk the signal physically
    Look for: shared trunk, vibration source, heat source, age
    Output: specific component or wire identified

STEP_5_ROOT_CAUSE (5 min):
    Why did this component or wire fail now?
    Permanent fix scope, temporary fix path
    Output: fix decision + log entry

LOG_DISCIPLINE: every step writes one line minimum to the shift log
                with timestamp, action, and observation.

The structure works on every fault category, not just electrical. A controls fault gets bisected as program-vs-IO. A comms fault gets bisected as physical-vs-addressing (see the communication troubleshooting tutorial). An environmental fault gets bisected as inside-the-panel vs outside-the-panel. The categories change; the five-step rhythm stays the same.

Common mistakes

Changing two things at once and losing bisection. The moment you swap the limit switch and the DI16 module on the same trip to the field, you cannot tell which one fixed it (or whether neither did and the symptom shifted). One change per cycle. Observe. Write it down. Decide the next change. The discipline feels slow on the first fault and saves three hours on the second.
Assuming the last-changed component without checking. The maintenance log says the limit switch was changed two weeks ago, so the limit switch must be the fault. Probably not. Recently-changed components are worth checking, but they are not worth assuming. Run the reproduce step first and let the trend chart tell you whether the signal is misbehaving electrically or mechanically. The last-changed-thing bias has cost more swap-cycles than any other cognitive trap on the SA plants we have walked.
Blaming software when the panel temperature is the cause. Intermittent faults that get worse in the afternoon and clean up overnight are almost always thermal. CompactLogix, S7-1500, and similar CPUs derate at panel internal temperatures above 55 degC, and the symptom is intermittent comms drops or random IO faults that look like a software bug. Check the panel internal temperature before debugging the program. A cheap stick-on thermal label or an infrared spot reader takes thirty seconds.
No log entry of what was tried. A fault that stopped reproducing during your shift but came back on the next shift is a fault the next engineer has to re-diagnose from scratch unless your shift log has every reading and every test. Two-line shift-log entries — "swapped channel 0 to channel 8 at 14:32, fault still present at 14:51, ruled out DI16 module" — are what turns intermittent-fault chasing from individual heroics into a team workflow. Without them, every shift starts at minute zero of a new diagnostic.
Skipping the reproduce step on a busy line. The temptation when production is breathing down your neck is to skip past reproduce and go straight to bisection because "the operator said it trips every twenty minutes". The operator's twenty minutes is rarely the same as the trend chart's twenty minutes — and the difference matters. Reproduce confirms the symptom shape (clean glitch versus noisy degradation), the time pattern (every twenty minutes versus every two hundred cycles versus every contactor pull-in), and the alarm correlation. Without reproduce, the next four steps have nothing to bisect against.
Treating the temporary fix as the permanent fix. A 50 ms input filter masks the EMI glitch on LBL_HOME_LS, but the contactor coil collapse is still happening every twenty minutes and the unshielded wire in the shared trunk is still picking it up. The day a different signal in the same trunk turns out to be more sensitive, or the day the contactor's coil suppression diode fails and the pulse gets bigger, the fault returns and the input filter is no longer enough. Always log the temporary fix as temporary and schedule the permanent fix at the next maintenance window. Plenty of "intermittent" plant faults are just temporary fixes that nobody scheduled to make permanent.

How to practise this in the simulator

The simulator's packaging-line scenario has a fault-injection menu with thirty different intermittent faults across the four categories — EMI coupling on signal wires, hot panel internals, marginal cable splices, IP collisions on the management LAN, programme bugs that fire on a rare interlock. Each fault has a specific time pattern, a specific symptom shape, and a specific bisection answer. Run the workflow on five different faults in a row, time each step, and keep a shift log. After ten faults the rhythm becomes automatic — gather in five, reproduce in five, bisect in ten, narrow in five, root cause in five — and the thirty-minute target stops being a stretch and starts being a habit.

Start the free tier →

Vendor reference

The cross-vendor reference for a structured failure-cause approach is the Wikipedia: Failure mode and effects analysis article, which covers the FMEA discipline of mapping cause to effect and ranking by detectability. FMEA is the design-time equivalent of the runtime workflow above, and reading one informs the other. The ISA training and certification syllabus has the CCST troubleshooting modules that codify this kind of workflow at the certification level, and the diagnostic-counter tooling in Studio 5000 and TIA Portal both expose the bisection-friendly signals (input filters, channel-level diagnostics, controller fault logs) that the workflow above relies on.

What we don't claim

This site is not SAQA-registered, not MerSETA-accredited, and not an NQF-registered qualification provider. Our completion certificates are course-level only — they describe what you covered, not an NQF Level X qualification. The CCST cert from ISA is the portable industry credential we recommend; we are not an ISA cert delivery partner either, but our cert packs are CCST-aligned. Fault-finding speed is a skill that builds with reps on real and simulated faults, not with a certificate alone.