iotdatadecision-making

How Operations Teams Are Using Sensor Data to Make Better Maintenance Decisions

Priya Nair

November 2, 2024

How Operations Teams Are Using Sensor Data to Make Better Maintenance Decisions

There's a gap that develops in most IoT deployments for commercial refrigeration fleets: sensors get installed, data starts flowing, and then the data sits. A dashboard somewhere shows live readings, a technician glances at it occasionally, and the system is considered "working." Months later, a compressor fails — and a review of the sensor logs shows clearly that the degradation trend was visible six weeks earlier.

The problem isn't data collection. The problem is the pipeline from raw sensor readings to a specific maintenance decision made by a specific person on a specific timeline. That pipeline has multiple failure points, and understanding where it breaks down is more useful than adding more sensors.

Stage 1: Raw Ingestion and Normalization

Commercial refrigeration sensors produce data in several formats depending on vendor and installation vintage. Older units often use RS-485 or Modbus RTU protocols with proprietary register maps. Newer units communicate via BACnet or JSON-over-MQTT to cloud endpoints. Sensor hardware from different manufacturers uses different engineering units, different sampling rates, and different timestamp formats.

Before any analysis happens, raw data needs to be normalized into a consistent schema: each reading associated with a unit identifier, a sensor type, a value in a known unit, and a reliable timestamp. Timestamp reliability is frequently underestimated as a problem — a compressor controller that has drifted 14 minutes from true time will create apparent correlations between pressure readings and vibration readings that don't actually coincide, causing false detections and missed patterns when you try to do multi-sensor analysis.

A practical normalized record for a refrigeration unit reading looks roughly like:

{
  "unit_id": "CF-14",
  "sensor": "vibration_hz_rms",
  "value": 0.48,
  "unit": "g",
  "timestamp": "2024-10-28T14:32:07Z",
  "quality": "good"
}

The quality field matters. Sensor hardware fails, communication links drop, and values occasionally read out of physical range (a pressure transducer that returns -9999 because the Modbus read failed, for example). A data quality flag at ingestion prevents bad readings from propagating into analysis stages where they would generate false alerts.

Stage 2: Baseline Establishment

Raw readings have no meaning without context. A vibration reading of 0.48g on a compressor running at 3,450 RPM could be healthy or alarming depending on the unit's history, its mounting configuration, and what it read last week. Absolute thresholds — "alert if vibration exceeds 0.7g" — generate enormous false positive rates because unit-to-unit variation in healthy operation is substantial.

What actually works for detecting degradation is comparison against a unit-specific baseline established during a known-healthy operating period. For each unit, the baseline captures the expected distribution of readings across operating states (startup, steady-state, shutdown, defrost cycle) for each sensor type. An anomaly detection system then evaluates current readings against that unit's own baseline — not against a fleet average and not against a static threshold.

Baseline establishment requires a calibration period — typically 4–8 weeks of confirmed-healthy operation — during which you build a statistical model of what normal looks like for that specific unit under its specific operating conditions. This is the stage most teams skip because it requires patience and discipline. The consequence of skipping it is a detection system calibrated to fleet averages that generates alerts on units that are just different from average, while missing failures on units whose degradation is slower than average.

Stage 3: Anomaly Detection with Persistence Logic

Once baselines are established, anomaly detection is the process of identifying readings or patterns that fall outside the expected distribution for a given unit and sensor combination. The challenge is distinguishing real degradation trends from operational noise.

Commercial refrigeration generates legitimate short-term sensor excursions constantly: defrost cycles cause temporary temperature and pressure swings, compressor startups cause vibration spikes, high ambient temperature events drive discharge temperatures temporarily elevated. An alert system that triggers on any single reading outside the expected range will page someone every few hours and train the operations team to ignore alerts.

The solution is persistence logic: requiring that an anomaly condition persist across multiple consecutive readings (or maintain a pattern over a defined time window) before generating an alert. A vibration reading 20% above baseline on a single sample is noise. A vibration trend 18–22% above baseline consistently over 5 days, in the specific frequency band associated with bearing wear, is a signal that warrants a maintenance work order.

The specific persistence parameters — how many readings, over what time window, in which frequency bands — are the most important tuning parameters in a condition monitoring system for refrigeration. Too loose and the system generates more alerts than anyone can act on. Too tight and it misses the early warning signals that have enough lead time to enable planned maintenance.

Stage 4: Failure Mode Classification

An alert that says "compressor anomaly on unit CF-14" is marginally better than no alert. An alert that says "vibration elevation consistent with early bearing wear, unit CF-14, 6-day trend, recommended action: bearing inspection and replacement within 3 weeks" is actionable.

Failure mode classification — mapping a specific pattern of sensor anomalies to a likely failure mode — is what converts a condition signal into a useful maintenance action. Different failure modes produce different signatures:

Bearing wear: Elevated mid-frequency vibration (60–350 Hz), often with BPFO or BPFI sideband harmonics, no immediate change in pressure or temperature.
Refrigerant leak: Gradual suction pressure decline, rising superheat (suction line temperature minus saturation temperature at suction pressure), declining discharge pressure. Vibration typically normal until refrigerant loss becomes severe.
Scroll tip wear: Declining volumetric efficiency (discharge pressure lower than expected for given suction pressure and ambient), elevated discharge temperature, broadband vibration elevation in the 500–1,000 Hz range.
Liquid slugging: High-amplitude, low-frequency vibration spikes during startup, often coinciding with elevated suction pressure at startup, indicating liquid refrigerant entering the compressor on startup.

Classification improves over time as you accumulate failure cases with confirmed post-repair diagnoses — each confirmed failure teaches the system which sensor patterns preceded that specific failure mode in your fleet's units.

Stage 5: Work Order Creation and Dispatch Integration

The most common place the signal-to-action pipeline breaks down is the handoff from anomaly detection to maintenance dispatch. Detection systems generate alerts. Someone needs to evaluate the alert, decide it warrants action, create a work order, assign it to a technician, and communicate the context. If any of those steps requires manual intervention at each occurrence, the alert will accumulate in a queue and the intervention timing will depend on how often someone checks that queue.

The goal is direct work order generation from confirmed anomaly alerts — with the failure mode classification, trend duration, recommended parts, and urgency level pre-populated in the work order. The dispatcher's job becomes reviewing auto-generated work orders against the dispatch schedule, not evaluating raw sensor data and deciding whether it warrants action.

We're not saying that automatic work order generation removes the need for human judgment — there are always cases where context matters and automation gets it wrong. The point is that the default path for a confirmed, persistent anomaly should be automatic work order creation, with human override available. Requiring human decision-making at every step creates a bottleneck that ensures alerts arrive too late to enable planned maintenance.

Closing the Loop: Post-Repair Data Feedback

A signal-to-action pipeline that doesn't receive feedback from completed repairs is operating open-loop. When a technician completes a repair, the outcome — what was actually found, what was replaced, what the confirmed failure mode was — should feed back into the anomaly detection model as a labeled training example.

In practical terms, this means work order completion requires the technician to record: the confirmed diagnosis, the parts replaced, whether the anomaly pattern correctly predicted the failure mode, and the unit's post-repair sensor baseline (to reset the baseline after a significant repair). Over time, this feedback loop improves failure mode classification accuracy for your specific fleet and operating environment.

Put these insights into practice

See how Fleetpio turns sensor data into scheduled maintenance visits before failures happen.

Request a Demo More Articles