Nowadays machine learning methods and data-driven models have been used widely in different fields including computer vision, biomedicine, and condition monitoring. However, these models show performance degradation when meeting real-life situations. Domain or dataset shift or out-of-distribution (OOD) prediction is mentioned as the reason for this problem. Especially in industrial condition monitoring, it is not clear when we should be concerned about domain shift and which methods are more robust against this problem. In this paper prediction results are compared for a conventional machine learning workflow based on feature extraction, selection, and classification/regression (FESC/R) and deep neural networks on two publicly available industrial datasets. We show that it is possible to visualize the possible shift in domain using feature extraction and principal component analysis. Also, experimental competition shows that the cross-domain validated results of FESC/R are comparable to the reported state-of-the-art methods. Finally, we show that the results for simple randomly selected validation sets do not correctly represent the model performance in real-world applications.
Zusammenfassung: Machine Learning und datenbasierte Modelle sind in der Literatur zu Computer Vision, Biomedizin oder Zustandsüberwachung weit verbreitet. Allerdings zeigen diese Methoden oft Schwächen in der realen Anwendung. Domain Shift oder Vorhersagen außerhalb der Verteilung der Trainingsdaten werden häufig als Ursache benannt. Besonders bei industrieller Zustandsüberwachung ist unklar, wann diese Probleme auftreten und welche Algorithmen robust dagegen sind. In diesem Beitrag werden die Ergebnisse einer klassischen ML-Auswertekette bestehend aus Merkmalsextraktion, Merkmalsselektion und Klassifikation bzw. Regression (FESC/R) mit jenen von mehrschichtigen neuronalen Netzen auf zwei öffentlich verfügbaren Datensätzen verglichen. Es wird gezeigt, dass mögliche Datenverschiebungen mittels Merkmalsextraktion und Hauptkomponentenanalyse sichtbar gemacht werden können. Weiterhin wird gezeigt, dass die mit FESC/R auf Domain Shift Problemen erreichten Ergebnisse gleichwertig zu denen von mehrschichtigen neuronalen Netzen sind. Letztlich wird gezeigt, dass eine zufällige Kreuzvalidierung die in einer realen Anwendung zu erwartende Genauigkeit eines ML-Modells nicht hinreichend abbilden kann.
Keywords: Machine learning; condition monitoring; domain adaptation; neural network; maschinelles Lernen; Zustandsüberwachung; Domänenadaption; neuronale Netze
Condition monitoring and predictive maintenance are important applications for machine learning (ML) algorithms. Input data in these applications comes from different industrial sensors, e. g., pressure, temperature, vibration, or microphones. Targets for these tasks are usually predicting fault types, remaining useful lifetime (RUL), or detecting anomalies. Detecting faults or anticipating upcoming failures can significantly reduce downtime of industrial systems and furthermore ensure the quality of products [[
The performance of modern data-driven models depends on the quality and quantity of supplied observations, however achieving proper data that covers all possible variations of a system and its environment to train these models is costly. A proper design of experiment (DoE) should include different control conditions and multiple recordings of a single target in different process situations and environments, e. g., for a ball bearing and an attached vibration sensor all possible combinations of temperatures, load and speed levels, lubrication conditions, vibrations transmitted by other machinery and peculiarity of production tolerances. This is exacerbated further when taking outdoor applications into account, e. g., for hydraulic machinery, because of the wider temperature range and additional environmental factors. Usually, variables considered less important for a process or expensive to change are ignored or varied in a limited range or step size to limit experiment costs. Either control variables are discrete or continuous, a design of experiment can cover just a limited number of them and respectively subsets of the complete target space are available for training [[
Many real-life applications of ML for condition monitoring impose domain shift problems onto the algorithms and thereby decrease its performance. Supervised ML methods mostly rely on the assumption that both training and test data come from the same distribution. This distribution of data can be called a domain and ideally, there is only one domain in a supervised learning task [[
In classical measurement science, changes in the environment (computer science: domain shifts) are tackled with calibration and adjustment of the measurement system which is also possible for machine learning algorithms. To perform adjustment of ML algorithms different algorithms and approaches are proposed. The work of Moreno et al. [[
The rest of the paper is structured as follows: Section 2 first introduces a dataset from a hydraulic machine representing a regression problem and a dataset on damage detection in a ball bearing representing a classification task. Both datasets comprise domain shifts that are visualized. Furthermore, Section 2 introduces the two ML approaches compared in this study, i. e., a more classical approach based on feature extraction, feature selection and classification/regression and a more modern approach based on neural network architecture search. Section 3 shows how classification and regression results are affected by domain shifts in the mentioned datasets and how calibration and adjustment can help to compensate those effects before the study is concluded in Section 4.
In this section, we introduce datasets and methods that are used in this study. Methods consist of ANNs and FESC/R which is based on conventional ML approaches. The two publicly available datasets are (
The first dataset used in this study is the recorded behavior of an HS where multiple common faults of such a system are simulated in a testbed [[
Graph: Figure 1 DoE of ZeMA dataset. All possible combinations of faults were repeated three times (a) for different cooler states, 100 % (normal operation), 20 %, and 3 % performance (set by varying the duty cycle of the cooler ventilator). The combination of faults for valve, pump, and accumulator is plotted in figure (b).
Graph: Figure 2 Features from the ZeMA dataset after PCA. (a) all data samples are colored by the cooler performance. (b) subset of observations that have more similarities. Shifts in the distribution due to the cooler changes are visible.
The control variable with the biggest influence on the sensor data is the performance of the cooler. To show the influence of the process conditions on the distribution of data, we extracted statistical features of raw data using StatMom which is described in Section 2.2.1. Then, Principal Component Analysis (PCA) was applied on the extracted features, the results for the first two components are shown in Figure 2. As is evident from Figure 2, the cooler has a major influence on the data distribution and a change of the cooler performance results in a shift along the first principal component (PC), indicating the main source of variance in the dataset. Consequently, for this task the observations that belong to each cooler state can be considered as separate domains. Additionally, the cooler state is the most expensive control variable to change because after each change the machine has to run for several hours before a new temperature equilibrium is reached, and conditions are stable again [[
The learning scenario chosen for this dataset is the assessment of the current valve switching characteristic from 72 % (barely working) to 100 % under the condition that only data from cooler state 20 % (equivalent to 55 °C average temperature) and 100 % (equivalent to 44 °C average temperature) are used for training. Correctly predicting the valve characteristic at cooler state 3 % (equivalent to 66 °C average temperature) [[
For calibration and adjustment of the models, data recorded at 3 % cooler state (new domain) and 100 % correct valve operation was considered. This is equivalent to using few measurements from a new machine (valve at 100 %) in a different environment for calibration and adjustment. The model is then evaluated on all data at cooler state 3 %.
Table 1 Summary of CWRU dataset.
Fault types Fault size (mil) Load (hp) Rotational Speed (rpm) Sensor Orientation No Damage 0 0, 1, 2, 3 1725–1796 12 Inner Ring 7, 14, 21 0, 1, 2, 3 1721–1796 12 Outer Ring 7, 14, 21 0, 1, 2, 3 1723–1796 3, 6, 12 Ball 7, 14, 21 0, 1, 2, 3 1721–1796 12
The second dataset that is used in this study was published by the Bearing Data Center of Case Western Reserve University (CWRU) [[
To demonstrate domain shifts and domain adaptation in classification tasks, the learning scenario was chosen to be the detection of fault type (vs. fault severity). The four groups to be detected are damage at the outer ring (OR), inner ring (IR), ball (B) and no damage (None). In a real world application this detection should be possible independent of the load. Therefore, the training data was chosen to be the data recorded at 1, 2, and 3 hp load. The test data is the data recorded at 0 hp load respectively.
As in the ZeMA dataset we extracted features from the dataset, the result of a PCA performed on the extracted features is presented in Figure 3. In contrast to the ZeMA dataset, it is expected that the most relevant features come from the frequency domain of the vibration sensor. Therefore, a Time Frequency Extractor (TFEx, Section 2.2.1) was used for this use-case. Figure 3a shows the PCA plot colored to indicate different loads of the motor and Figure 3b visualizes the same data by coloring according to the damage target for the defined scenario. The healthy state, highlighted with an ellipse in both figures, shows a shift of the data for the motor at zero hp, which can cause difficulties for a model trained only on the other load conditions (
Graph: Figure 3 Features from the CWRU dataset after PCA. Visualizing the features based on the motor loads (a). Visualizing the features based on the damage types (b).
Various data-driven models have been applied in condition monitoring and predictive maintenance, including linear discriminant analysis (LDA) [[
Conventional ML methods have been used for a long time [[
We can formulate the conventional ML methods in form of a pipeline that consists of feature extraction (FE), feature selection, and classification (FESC) or regression (FESR). Depending on the model and input dimensions, it is also possible to apply a classifier/regression directly on the raw data, but in general FE methods are needed to reduce the dimensionality of the data. FE methods are usually necessary for condition monitoring applications because the raw data can be high-dimensional inputs [[
In this study an open-source MATLAB toolbox [[
Here the focus is on showing these methods characteristics in OOD problems and the goal is measuring the robustness of the models in an OOD scenario. The toolbox is used to search for the best methods and HPs for both datasets then from the results the following methods are selected. The first FE function is called StatMom [[
Table 2 Features used in TFEx and StatMom.
TFEx (time and frequency domain) StatMom (time domain) RMS Mean Variance Variance Skewness Skewness Kurtosis Kurtosis Position of maximum Linear slope Maximum Peak to RMS ratio
As FS we used two methods, namely Relieff [[
ANNs with three or more layers are called deep neural networks (DNN) therefore many modern network architectures are classified as deep learning methods. Over the past decade deep learning algorithms have been used in various applications and achieved outstanding results [[
Designing and training DNNs requires tuning many hyper-parameters (HP). Hyper-parameters in ANNs can be categorized into two groups, the first one contains architecture HPs and the second learning HPs. Architecture HPs are parameters that specify the structure of a network, i. e., number of layers, filter size, number of filters in a layer, number of neurons in a layer, and of course the type of a layer. Training HPs specify the training process for a network when the architecture is fixed. Initial learning rate, mini-batch size, and number of epochs are examples for the training HPs. The process of choosing the best HPs is generally called HP optimization and more specifically for architecture HPs is named Neural Architecture Search (NAS) [[
Table 3 List of HPs for CNN, including the search ranges. An iterative HP optimization approach is used and the ranges for the initial and final ranges reported.
HP Initial trial Final trial Initial learning rate (log scale) 10−4–10−2 0.002 Kernel size 2–10 3–5 Depth 3–10 (Conv blocks) 5–10 # of neurons, fully connected layer 1–1000 1–100 # of filters Fixed, relative to the depth Fixed, relative to the depth 1st convolutional layer filter size 10–100 10–35 Batch Size 32 32
Table 4 List of HPs for WaveNet-based network, including the search ranges. An iterative HP optimization approach is used and the ranges for the initial and final ranges reported.
HP Initial trial Final trial Initial learning rate (log scale) 10−4–10−1 10−3–10−2 Kernel size 2–10 3–6 Depth 3–10 (Conv blocks) 3–10 (WaveNet blocks) # of filters 8–100 40–80 1st conv layer filter size 20–100 20–50
Although NAS showed particularly superior results outperforming human designed networks [[
We used evolutional parametric architectures together with Bayesian optimization [[
Table 5 List of fixed HPs during experiments.
HP CNN WaveNet-based Batch Size 32 32 L2-regularization 0.001 0.001 Learn rate drop rate 0.9 0.9 Maximum epochs 100 10 Optimizer ADAM ADAM
Two types of CNNs are used in this study: conventional CNN with a single forward path and a WaveNet-style [[
Table 6 Domain adaptation compared with similar approaches.
Source and target tasks Source and target domains (joint distribution) Access to target domain Supervised Learning Same Same – Transfer Learning Same/Different Same/Different Yes Domain Adaptation Same Different Yes, unlabeled, or limited labels Domain Generalization Same Different No
As mentioned before, many ML approaches suffer from a degradation of the performance in real world scenarios due to a shift between training and test data [[
Note that domain adaptation in ML is equivalent to the calibration and adjustment of conventional measurement systems. Both for ML methods and conventional measurements the deviation between the system output and a known target in few calibration measurements is used to adjust the output accordingly. This is typically done after a change in the environment (domain change) of the sensor system. Because both application examples shown in this paper can be interpreted as domain adaptation tasks the rest of this paper will focus on domain adaptation.
In this section the results of evaluations for FESR/FESC and DNN models are reported side by side to allow easier comparison.
Although the target and other variables in this dataset are discrete numbers (due to restrictions concerning DoE), they represent continuous variables, and a model should generalize over the complete ranges. The published dataset [[
Graph: Figure 4 Prediction results for ZeMA dataset, linear lines show a fitted function on the training (red points) and test (blue points) predictions, to have a better visual representation a jitter plot is used. (a) Results from trained FESR stack. (b) Results from a trained CNN which is selected based on the validation loss.
In the earlier sections we illustrated the domain shift in the ZeMA dataset at the feature level. In this section we show the effect of this phenomenon when we train a model under this condition. The results are from two families of algorithms, FESR and deep learning models.
Graph: Figure 5 The output of the NAS algorithm for the ZeMA dataset (a). A convolutional block in this network consists of a convolution layer, a batch normalization layer and a ReLU layer (b).
Starting with the FESR model, we trained a stack of selected methods for the defined task as described in Section 2.1.1. For the selected stack the FE method is StatMom, the FS is the Pearson correlation method, and finally PLSR is the last method of the stack. The results of predictions for the training and test data are plotted in Figure 4a. As there are just four discrete values in the targets a scatter plot with jittering is used to provide a better view of overlapping data points – otherwise all samples would occur in four vertical lines and would be less distinguishable. Although the slope of the fitted linear lines for train and test data are similar, there is a clear shift between them. The change in temperature causes an offset error of approx. 2 %. This is equivalent to a conventional sensor system that suffers from a small cross-sensitivity to temperature. The root mean square error (RMSE) increases from 1.53 % (validation data) to 2.45 % (test data). The reason for this variation is the shifts of the distributions which was visualized in Figure 2; as the algorithms are not aware of the distribution of the test data, the shifts are not compensated. In the following we compare the results of a trained deep network for the same task.
Graph: Figure 6 Final trial of the NAS algorithm. In this plot the validation data are 20 % of the training set which were randomly selected. The test data is from a different distribution, i. e., a different operating temperature. Each point is a trained network, (a) ZeMA use case, (b) CWRU use case.
Alternatively, we searched for a DNN architecture to fulfill the same task. The selected DNN is a 9-layer CNN as the outcome of the NAS algorithm with the architecture and parameters as reported in Figure 5 and Table 7, respectively, with the HPs ranges for the first and last trials of the search algorithm given in Table 3. The final ranges for the parameters are values that led to the best networks (with lowest validation losses) in earlier trials. Figure 6 shows the final trial of the NAS progress, each point in the plots is a trained model with the color representing the iteration number of the model from blue to yellow. Since the objective function of this process is the validation loss, the architecture corresponding to the lowest value was selected as the final model. However, the test RMSE of the resulting model is not as low as the validation RMSE, with validation and test errors of 1.15 % and 9.75 %, respectively. To explain why the trained network generalized so poorly on the test data, the predictions of the network for both training and test data are visualized in Figure 4b, also allowing direct comparison to the FESR model.
Table 7 Summary of parameters of the selected CNN after performing the NAS.
Layers Filter Size (H × W) Number of filters Stride Conv Block 1 1 × 20 8 1 × 3 Conv Block 2 1 × 4 8 1 × 2 Conv Block 3 1 × 4 16 1 × 2 Conv Block 4 1 × 4 24 1 × 2 Conv Block 5 1 × 4 32 1 × 2 Conv Block 6 1 × 4 40 1 × 2 Conv Block 7 1 × 4 48 1 × 2 Conv Block 8 1 × 4 56 1 × 2 Conv Block 9 1 × 4 64 1 × 2 Fully Conn 1 81 – – Dropout 50 % – – – Fully Conn 2 1 – –
The deviation between the features of the source and target domains leads to a shift in the final predictions. While the selected network performs accurately on the validation data which are selected from the training distribution, it has difficulty in generalizing to the test data. As is evident in Figure 4b, the test data are divided into two groups, with one having a slight shift only from the training data but the second group being significantly shifted away leading to approx. 10 % error for the predictions. These two groups are visible also at the feature level in Figure 2a, where the test data consists of two separate groups. To allow a better visual representation of this problem, prediction results of the test data are plotted explicitly in Figure 7. The slope of the fitted line for both groups is almost identical but there is a clear offset between the two groups. Note that this problem would not be visible if a simple random choice of test and training data had been used instead. Therefore, the validation scenario must be designed precisely to ensure covering cross-domain situations.
Graph: Figure 7 Predictions on the test dataset by the CNN. The "test group 1" are observations with a low error similar to the training data, while the "test group 2" are data with a significant shift regarding the target and thus high error.
Graph: Figure 8 The FESR model predictions after recalibration (a). The CNN predictions after recalibration (b).
As shown in the last section, shifts in the dataset can significantly degrade the performance of a trained model on test data, especially if these represent a different domain. To reduce this problem and improve the results, calibration and adjustments are required. Calibration is performed using the test data of a single class (here: observations with 100 % performance) to simulate the real-world application of the previously trained model to a new machine that is working at 100 % but in an environment with a different temperature. As the simplest form of adjustment, the measured offset is removed in post-processing. Figure 8 shows the results after recalibration for both tested models, quantitative results are reported in Table 8. Recalibration for the ANN model is done just for the second test group (in Figure 7) that had a dominant shift with regard to the training data. While the results for both models improve with domain adaptation, FESR clearly yields a superior result with a test RMSE of 1.58 which is almost as low as the validation RMSE, while the RMSE of the CNN, although reduced by a factor 3, is still almost twice as high at 3.34.
Table 8 Error rates for ZeMA dataset before and after recalibration.
Model Validation RMSE Test RMSE Test RMSE after recalibration FESR 1.53 2.45 1.58 CNN 1.15 9.74 3.34
Graph: Figure 9 LDA projection of the features in the CWRU use case (a). PCA plot of the embedded features from the network in the last convolution layer, the graph shows the first and second principal components (PCs) of the features (b).
Graph: Figure 10 The WaveNet-based model with the lowest validation loss in the NAS algorithm (a). The WaveNet block (b) and a convolution block (c) that are used in the architecture.
In the same way as for the HS use case, we first chose a stack of FESC that works best for this task. As mentioned above the FE method is TFEx (cf. Section 2.2.1), with Relieff used for FS and finally LDA and Mahalanobis distance for classification. The test accuracy for the test data is 99 %, which is exceptionally good. To check if the model compensated the shift for the test set, we visualize the projected features after the LDA. Figure 9a shows the results of the projection, which shows a small shift between training and test data for the damaged samples, but a significant shift for the healthy state (damage type "None"). However, the projections of those observations are still sufficiently far away from the other groups to be classified correctly. Also, it should be noted that the shifts are not in the same direction for all target groups, due at least in part to the fact that the targets are categorical and can therefore not be sorted in a logical order.
As mentioned above we expect relevant features also from the frequency domain for this use case, therefore a network architecture that previously showed superior results for raw audio and vibration signals, WaveNet, is used. An HPs search for the WaveNet-based network in accordance with Table 4 was conducted and resulted in the network shown in Figure 10 with HPs as described in Table 9. Similar to the earlier use case the validation accuracy of many networks is 100 % but selecting a network that generalizes well to the test set is challenging and still an open question [[
Table 9 Summary of parameters used for the WaveNet-based network.
Layers Filter Size (H × W) Number of filters Stride Dilation Factor Conv Block 1 1 × 50 80 1 × 3 1 × 1 WaveNet Block 1 × 5 80 1 × 1 5(BlockNumber-1) Conv Block 2 1 × 4 80 1 × 2 1 × 1 Pooling 1 1 × 4 – 1 × 4 – Conv Block 3 1 × 8 80 1 × 1 1 × 1 Conv Block 4 1 × 8 80 1 × 1 1 × 1 Pooling 2 1 × 8 – 1 × 8 – Final Conv 1 × 1 4 1 × 1 1 × 1
Graph: Figure 11 LDA projection of the features in the CWRU use case after recalibration (a). Embedded features from the network in the last convolution layer after recalibration, the graph is the first and second PCs of the features (b).
Although the test accuracy of the trained FESC stack is almost perfect (98.8±0.8 %), we still use calibration and adjustment to compare the results. For this use case shifts from the target groups are different for each individual class therefore using a single class to calibrate the test set is not sufficient. This is evident in Figure 9a; if we move the test data for the healthy state to the mean value of the training set and then apply the same distance for other classes, it increases the observed shifts for the other classes considerably. One solution is applying standardization using a small portion of the test set from all classes. Thus, 20 % of test data from each class was used for this form of calibration and adjustment. The labels of the recalibration data are not needed. Figure 11 shows the results after standardizing the training and test data for both the FESC stack and the ANN model. Quantitative results are presented in Table 10; because of the stochastic evaluation procedure, the mean and standard deviations of 10 different runs are reported. Similar to the HS use case, a significant improvement is achieved for both ML approaches with the proposed domain adaptation, however, the performance of the FESC approach for the domain shift is significantly better than for the deep network. Furthermore, it had proved to be more robust to the domain shift even before domain adaptation, i. e., might be considered as domain generalization.
Table 10 Accuracy the models for CWRU dataset, before and after recalibration.
Model Validation Accuracy % Test Accuracy % Test Accuracy % after recalibration FESC 100 98.8 99.7 ± 0.3 WaveNet-style 100 81 92.5 ± 0.5
In this paper DNNs were compared with conventional methods based on feature extraction and selection in scenarios with distribution shifts caused by changing ambient or experimental conditions. By visualizing the data at different levels, it was shown how shifts from raw data can propagate to a model and cause shifts in the predictions. As shifts in the data distribution are inevitable in many real-life scenarios, this issue needs to be considered when building a comprehensive ML model, i. e., in the model selection, validation scenario, training process adaptation. In the presented scenarios the conventional FESC/R approaches show better results compared to the ANN solutions. Although finding a DNN to correctly predict the training data is not difficult using NAS algorithms selecting a network that generalizes to the test data is highly challenging in a cross-domain situation. We also presented two simple domain adaptation techniques to improve the results of trained models. This showed that domain adaptation can be formulated as recalibration especially for regression use-cases achieving good results for both ML approaches, but again with significant advantages for the conventional approach. For classification tasks this recalibration is not as straightforward due to the categorical nature of the target data and did not show significant improvement. Again, the conventional approach proved to be more robust against distribution shifts and did achieve better performance after recalibration by normalization. Moreover, in the CWRU use case the FESC method achieves near perfect accuracy for a cross-domain scenario even before recalibration, thus can be considered as an example for domain generalization.
For future work further investigation is suggested in why FESC/R performs better than DNNs in inter-domain scenarios which could help in improving the ANN architectures making them more robust for real-world applications. One could assume that this results from the implicit extraction of useful information from the data during the feature extraction and selection steps reducing the task complexity and making the results more stable with respect to possible changes in the input data. On the other hand, the classical approach can be boosted by explicitly introducing non-linearities based on polynomial expansion of the features in combination with linear classification/regression algorithms as recently suggested [[
By Payman Goodarzi; Andreas Schütze and Tizian Schneider
Reported by Author; Author; Author
Payman Goodarzi studied Embedded Systems at Saarland University and received his Master of Science degree in March 2020 with a thesis on the interpretability of neural networks. Since that time, he has been working at the Lab for Measurement Technology (LMT) of Saarland University and at the Centre for Mechatronics and Automation Technology (ZeMA) as a scientific researcher. His research interests include ML and deep learning for condition monitoring of technical systems.
Andreas Schütze received his diploma in physics from RWTH Aachen in 1990 and his doctorate in Applied Physics from Justus-Liebig-Universität in Gießen in 1994 with a thesis on microsensors and sensor systems for the detection of reducing and oxidizing gases. From 1994 until 1998 he worked for VDI/VDE-IT, Teltow, Germany, mainly in the fields of microsystems technology. From 1998 until 2000 he was professor for Sensors and Microsystem Technology at the University of Applied Sciences in Krefeld, Germany. Since April 2000 he is professor for Measurement Technology in the Department Systems Engineering at Saarland University, Saarbrücken, Germany and head of the Laboratory for Measurement Technology (LMT). His research interests include smart gas sensor systems as well as data engineering methods for industrial applications.
Tizian Schneider studied Microtechnologies and Nanostructures at Saarland University and received his Master of Science degree in January 2016. Since that time, he has been working at the Lab for Measurement Technology (LMT) of Saarland University and at the Centre for Mechatronics and Automation Technology (ZeMA) leading the research group Data Engineering & Smart Sensors. His research interests include ML methods for condition monitoring of technical systems, automatic ML model building and interpretable AI.