Machine learning for prognostics and health management of industrial mechanical systems and equipment: A systematic literature review

In the last decade, the adoption of technological tools in manufacturing industry, such as the use of the Internet of Things (IoT) and Machine Learning (ML), has led to the advent of the industry 4.0 (I4.0). In this scenario, intelligent devices can generate large volumes of data about industrial machinery and equipment that can be used to make maintenance more efficient. Prognostics and Health Management (PHM) is an emerging maintenance strategy that uses systems’ Condition Monitoring through IoT sensors installed on machinery to diagnose their faults or estimate their Remaining Useful Life (RUL). This study aims to conduct a Systematic Literature Review (SLR) on the use of ML techniques in the field of PHM of industrial mechanical systems and equipment. 50 studies resulted eligible for the above-mentioned SLR. Diagnostics and prognostics approach and the ML algorithm types used in the 50 analyzed papers have been analyzed together with the Key Performance Indicators (KPIs) used for their validation. From the analyses, it was found that Shallow Learning and Deep Learning (DL) algorithms are the most applied ones, while KPIs are used differently according to the type of task classification or regression. Moreover, results highlighted that many authors still use artificial datasets to test their algorithms, instead of datasets based on real data retrieved by their components. For the last type of datasets, this paper also introduces a schematic framework to standardize the step-by-step diagnostics and prognostics process carried out by the authors.


Introduction
Words like "Internet of Things" (IoT), "Cyber-Physical Systems" (CPS), "Internet of Services" (IoS), "Digital Twins" (DT), and "Machine Learning" (ML) have laid the foundations for the so-called Industry 4.0 (I4.0), which has prompted many companies to completely renew the concept of maintenance, improving productivity, preventing downtimes and reducing costs. 1,2mong the different maintenance strategies, Prognostics and Health Management (PHM) represents one of the most innovative, perfectly fitting into the new I4.0 scenario; indeed it is based on systems' Condition Monitoring (CM) through IoT sensors installed on machinery. 3PHM absolves the two important tasks related to diagnosis and prognosis to define their health state and avoid unexpected failures by preventing damages. 4Different parameters can be monitored in PHM according to the type of equipment, such as temperature, vibration, pressure, acoustic emission, force, tension, and others. 5Industrial Structures, Systems, or Components (SSCs) are considered to be in a normal state if these parameters remain above a predetermined threshold. 6Indeed, the evolution in time of these parameters can be used to monitor any deviation from normal operating conditions, which can help to determine the time the equipment is in good condition before it falls into a state of non-healthy condition.Therefore, PHM is mainly focused on both Fault Diagnosis (FD), when a failure state is present and there is the necessity to investigate the source of the anomaly, and Fault Prognosis (FP), when the necessity is to predict the future degradation until complete failure occurs; 7 in such last case, often, the Remaining Useful Life (RUL) of SSCs is estimated.RUL is defined as the time length from the current time to the end of the useful life, that is, when the system condition reaches the failure threshold. 8The forecasting window plays a crucial role in prognosis because the objective is to provide an estimate of the future time-step when a certain event will occur. 9In recent years, several methods to evaluate RUL or FD have been proposed, such as model-based, data-driven, or mixing both of them.Model-based approaches rely on the knowledge of the inherent system failure mechanism to build a degradation mathematical model to describe the physical nature of the fault; 10 on the other hand, data-driven techniques rely on collected data to extract knowledge about the health status of the monitored equipment.This task is particularly suitable to be performed by ML algorithms. 11These algorithms range from conventional Shallow Learning (SL) techniques such as Artificial Neural Network (ANN), Support Vector Machine (SVM), Decision Tree (DTR), RF, to more recent techniques, such as DL algorithms. 12achine learning is arising as one of the major approaches for PHM and RUL estimates.Machine learning is mainly used for solving two types of tasks, namely, "Classification" and "Regression".Classification tasks have a finite number of output classes, while, in regression tasks, an infinite number of outputs are represented as real-valued data.By its nature, the FD is a classification problem.RUL prediction, instead, is often a regression problem, even if there are rare cases in which the RUL is treated as a classification problem. 13egardless of the type of algorithm used or the type of task faced, an important step in using ML in PHM is being able to measure the performance of the algorithm.Therefore, it is necessary to define Key Performance Indicators (KPIs) to determine the accuracy of an algorithm and the associated methodology.
The study aims to conduct a Systematic Literature Review (SLR) on the use of ML techniques in the field of PHM of industrial mechanical systems and critical equipment.To the best of the authors' knowledge, the problem presented in this paper has not been addressed previously.Thus, to fill this gap, the study investigates diagnostics and prognostics applied to the industrial SSCs, the kind of ML algorithms used and on the Key Performance Indicators (KPIs) for validating them.
The rest of this paper is outlined as follows: next section presents the methodology followed to conduct the research that led to the identification of the selected studies; then, the results will be analyzed and discussed by carrying out a bibliometric analysis and answering the aforementioned RQs; finally, the last section highlights the conclusions and future works.

Research methodology
To have an overview on the use of ML techniques in the field of PHM of industrial SSCs, an SLR 14 similar to the ProKnow-C methodology 15 was performed to answer the following Research Questions (RQs): -RQ 1.What are the most used ML algorithms for PHM's diagnostics and prognostics of industrial SSCs? -RQ 2. What are the main performance metrics of ML algorithms adopted in PHM of industrial SSCs?
To answer the aforementioned questions, a literature search was performed on the Scopus database (www.scopus.com),][18][19] The research string was run on January 10, 2023.Aiming to restrain the search field to the desired themes only, several combinations of keywords have been used for including all the possible papers related to the concepts of: -PHM (i.e., "PdM" OR "predictive maintenance" OR "data-driven PdM" OR "prognostic", OR "conditionbased maintenance"); -diagnostics and prognostics (i.e., "fault" OR "RUL"); -ML (i.e., DL).
Each of these keywords was searched in the abstract, title or keywords (TITLE-ABS-KEY) of the documents, that is, at least one keyword for each of the 3 above-mentioned batches must be present either in the title, or in the abstract, or in the document keywords.This query provided a first set

2
International Journal of Engineering Business Management of 483 results; then, some Inclusion Criteria (IC) were considered: -IC 1.Only papers in the final publication stage; -IC 2. Only English language papers; -IC 3.Only recent papers that were published between 2008 and 2023.
This first filtering returned 418 papers; to further limit this number of documents, a fourth IC was added to the others: -IC 4.Only journal papers (other kinds of documents, such as books or conference papers were not considered).
At this point, three further steps, described below, were conducted to finally find the ultimate papers: First, 31 documents were removed by simply reading the papers and journals titles, because they were not related to the industrial mechanical systems field (i.e., medical, railway, robotics, or chemical field); Second, the remaining 223 abstracts were analyzed, discarding 148 documents because either they did not consider industrial mechanical applications, but aeronautical, aerospace, and chemical applications, or they did not consider applications to validate their ML algorithms at all; the remaining 75 documents went to the next analysis step, even if 24 of these 75 needed a more in-depth analysis because by simply reading the abstracts, it was impossible to determine neither if authors considered some kind of applications, nor if the applications were in theme with the interested industrial field; Finally, a full paper analysis was conducted, which allowed excluding 25 documents because they consider neither ML algorithms (but statistical techniques), nor industrial mechanical applications.Therefore, 50 papers out of the 254 were considered eligible for the following analysis.
An overview of the whole search process is provided in Figure 1.

Results and discussion
In this section, the results are shown and discussed according to the previously defined RQs.First, in section 3.1, a bibliometric analysis 20 was carried out to highlight the trends of the analyzed publications over the years.Next, RQ 1 and RQ 2 were answered, respectively, in section 3.2 and section 3.3 where the ML algorithms and KPIs used in the 50 analyzed studies are examined; finally, in section 3.4, a schematic framework was developed to standardize the diagnostics and prognostics process carried out by the authors who used own unique datasets, and not the common public available datasets.

Bibliometric analysis
Figure 2 shows how the 50 selected papers are distributed over the years, including the number of citations received per year.They cover an 8-year long period, starting from 2016 until 2023, although the IC 3, defined in the previous section, considers eligible only papers starting from 2008.Only 20 papers of the first set of 483 results belong to the 2008-2015 years and none of them is about the industrial mechanical field, but medical, railway, chemical or aeronautical field.For such a reason, they were excluded from the final analysis.In Figure 2, it is possible to note that the number of papers increased in the last few years, reaching the peak of 14 publications in 2021.This increasing number of studies over the years is not surprising, considering that the word "Industry 4.0" was used for the first time in Germany in 2011, and precisely during the Hanover Fair, where the Communication Promoters Group of the Industry-Science Research Alliance (FU) announced a project for the development of the German industrial manufacturing sector, the "Zukunftsprojekt Industrie 4.0" 21 ; since then, the German model, combined to the improvements of the interconnectivity of the IoT and robotics devices brought by Artificial Intelligence (AI) technologies, has inspired numerous researchers to continue researching the ML field to improve the productivity and reduce the costs related to industrial maintenance. 22oncerning the number of citations per year, it is possible to note from Figure 2 that the trend is not stable, with a peak of 746 citations in 2019, an average of 373.8, and 0 citations in 2023, because of the narrow time window available to receive citations in this year, considering that the literature search date is on January 10 th , 2023.

Polverino et al.
Table 1 shows the most relevant journals of the analyzed papers (journals with only one paper each were put together in the last row named "others").
The 50 analyzed papers present 6 different types of SSCs (Figure 3) and 9 different types of datasets (Figure 4): 8 are online public datasets, and 1 is an "owndatasets" type, that is, datasets created specifically for the task addressed and the industrial application of the authors.It is possible to note that the sum of the percentage values in Figure 3 and in Figure 4 is beyond 100% because often more than one type of SSC and/ or datasets was examined by the authors.About the mechanical systems and components analyzed in the papers, bearings are in 74% of the analyzed studies, followed by gears at 16%, milling machine's cutting tools at 10%, a pump's impeller, a ball screw and a hot strip mill's roller at 2%.These trends can be explained by noting that the most of problems arising in rotating machinery are caused by faulty gears and bearings. 23As components between the stationary and the rotating part of the industrial machinery, bearings represent an essential part of them; in fact, it causes more than 50% of induction motors' failures mainly because of overheating, too high axial and radial loads, and electrical stress such as the presence of bearing currents. 24As a consequence of the predominant presence of bearings and gears as SSCs analyzed by authors, four popular public datasets resulted to be the most used in the analyzed papers, that is, for bearings, IEEE PHM 2012 Challenge dataset (36%), XJTU-SY and CWRU bearing dataset at (18%), and, for gears, PHM 2009 challenge dataset at (8%).The remaining four datasets consist of two datasets for gearboxes (University of Alberta gearbox and 2021 Tsinghua University dataset), one dataset for milling machine's cutting tools (IMS-Foxconn dataset), and one dataset for bearings (NASA bearing dataset).Moreover, from Figure 4, it is possible to note that 38% of the analyzed papers present datasets created for the specific problems investigated by the authors; this theme is examined in depth in section 3.4.International Journal of Engineering Business Management

Machine learning algorithms for PHM of SSCs
This section aims to respond to RQ 1, that is, What are the most used ML algorithms for PHM's diagnostics and prognostics of industrial SSCs?
The constant increase in data availability due to intelligent sensors installed on SSCs, in addition to the technological progress in terms of computers' hardware and software and a large number of cross-platform libraries, such as MATLAB, Python, R, and Sci-kit Learn, have led to the rapid development of multiple ML techniques to better address the issue of PHM of SSCs.These techniques range from the first classic SL techniques to the more recent DL ones.The word "shallow," is from the single hidden layer belonging to the first simple neural networks, therefore usually nowadays "Shallow Learning" refers to all the traditional ML models, that is, those proposed before 2006; 25 among these, those used in the 50 analyzed studies are: shallow ANN, i.e., neural networks with only one hidden layer of nodes, SVM, DTR, RF, statistical models, and hybrids, that is, combinations of these algorithms; on the other hand, DL models are based on neural networks with the addition of multiple hidden layers between the network's input and output; 7 among these, those used in the 50 analyzed studies are: Deep Neural Network (DNN), Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), Auto-Encoders (AE), Restricted Boltzmann Machines (RBM) and hybrids, that is, combinations of these algorithms.Furthermore, the cases of SL/DL hybrid methods, that is, algorithms in which SL and DL models are combined, are not uncommon.

Polverino et al.
Table 2 shows all the ML algorithms used in the analyzed papers, clarifying their nature (type of algorithm) and their family (SL, DL, or SL/DL hybrid methods).
Moreover, the frequency of citations of the aforementioned ML algorithms is shown in Figure 5, where it is clear the predominance of the DL methods (82%, i.e., 41/50 sample papers) both on the SL methods (10%, i.e., 5/50 sample papers) and Hybrid SL/DL ones (8%, i.e., 4/50 sample papers).One of the reasons for the higher use of DL, supplanting the traditional SL algorithms, is the ability to skip the process of hand-extraction features from the input data before being fed into the network, thanks to a nested series of consecutive computations that result in the extraction of a set of complex and highly informative features; moreover, in these years, an increasing number of empirical results have shown that these models return better results in terms of diagnostics and prognostics performance, compared to "shallow" methods.The main problem is that, compared with SL models, DL ones require a larger amount of training data (not always available) and the models to build are more complex. 7he pie charts in Figure 6 show how the ML techniques are distributed among the 50 sample papers, dividing them into SL algorithms (a), DP algorithms (b), and hybrid ones (c).About the SL algorithms, as aforementioned, only 5 of 50 analyzed papers use SL methods, with a prevalence of RF (30%, i.e., 3 times), followed by DT and SVM (20%, 2 times each), ANN, ANFIS (hybrid between ANN and a statistical model), and R-S-G statistical model (10%, 1 time each).RF, DT, SVM, and ANN have been used for prognostics tasks, while ANFIS and R-S-G statistical models for diagnostics tasks.About the DL algorithms used in 41 of the 50 analyzed papers, they are distributed as follows: CNN and RNN are the most used (25.6%, i.e., 11 times each), followed by DNN and Hybrid ones (16.3%, i.e., 7 times each) that are constituted by two models, that is, a mash-up between CNN and RNN (14%, i.e., 6 times) and a mash-up between RNN and DNN (2.3%, i.e., 1 time); the DL algorithms less used are RBM (9.3%, i.e., 4 times) and AE (7%, i.e., 3 times).CNN, DNN, RBM, AE, and Hybrid ones have been used both for prognostics and diagnostics tasks, while RNN has been used only for prognostics tasks.In conclusion, about the DL/SL hybrid algorithms, there are four of them: CNN-RF, RBM-ANN, DNN-DT, and AE-SVM; each of these algorithms has been used only for prognostics tasks.

Machine learning KPIs for PHM of SSCs
The efficiency and effectiveness of an ML model can be evaluated using Key Performance Indicators.KPIs are usually divided into 2 groups: (i) KPIs for ML classification tasks, for which the output is divided by positive and negative classes.For instance, considering FD described through a simple binary classification, negative class stands for "fault" and positive class stands for "working"; (ii) KPIs for ML regression tasks, for which the output may be any value.For instance, for RUL, KPIs may range from 0 to 100, where 0 stands for "fault" and the other values stand for "still working."Since different ML tasks produce different outputs (i.e., continuous or discrete), the related KPIs are consequently different too.6 International Journal of Engineering Business Management

KPI Formula Description
Accuracy (A) It is the ratio between the total number of correctly classified samples and the total number of samples within the test set.It is bounded to [0, 1], where 1 represents predicting all positive and negative samples correctly, and 0 represents predicting none of the positive or negative samples correctly.
It is the ratio between correctly classified positive samples and all samples assigned to the positive class.It is bounded to [0, 1], where 1 represents perfectly predicting the positive class, and 0 represents incorrect prediction of all positive class samples.Precision (P) It is the ratio between correctly classified class samples and all samples assigned to that class."Class" is a variable that can assume both "positive" (C = P) and "negative" (C = N) values.The positive case of the precision (C = P) is called "positive predictive value" (PPV) which is the ratio between correctly classified positive samples and all samples classified as positive, while the negative case of the precision (C = N) is called "negative predictive value" (NPV) which is the ratio between correctly classified negative samples and all samples classified as negative.
It is a curve plotted between false positive rate (FPR) on the x-axis and recall on the y-axis.FPR, just like recall, has values in the range [0, 1], but 1 represents incorrect prediction of all negative class samples, and 0 represents perfectly predicting the negative class.The AUROC value changes according to the model, but in the case of simple binary classification, the AUROC is equal to the equation (6).It is bounded to [0,1] where 0 means that the model is predicting a negative class as a positive class and vice versa, and 1 means that the model has a perfect capacity to separate the classes.
It is a n x n matrix, where "n" is the number of classes that are to be predicted.In the case of binary classification (n = 2), the confusion matrix looks like the equation (7).It is not exactly a performance metric but it is a starting point on which the other metrics, definable starting from the matrix, evaluate the results.Polverino et al.
To answer the RQ 2 (What are the main performance metrics of ML algorithms adopted in PHM of industrial SSCs?), first of all, a brief description of the Evaluation Metrics (EM) used in the 50 analyzed papers is shown in Table 3 (KPIs for classification tasks) and Table 4 (KPIs for regression tasks). 26reover, the frequency of citations of the aforementioned EM is shown in the diagram in Figure 7, where the KPIs are divided into classification (a) and regression (b) tasks.Among all the 50 analyzed papers, 18 deal with the classification task, while the remaining 36 deal with the It is a normalized weighted sum of relative accuracies at specific time instances (RA λ ).The latter is defined as a measure of the error in RUL prediction, relative to the actual RUL at a specific time index (RUL i (λ i )).RA λ is used as a metric to emphasize that errors closer to the actual failure of a component are more severe.λ i is defined as the normalized time and it is the ratio between the time-instant (t i ), and the time-to-failure of component (t f ); λ is bounded to [0,1], where 0 means that the component is on its maximum state of health, while 1 means that the component has failed. 27,28ere: 8 > > < > > : This metric was used for the IEEE PHM 2012 prognostic challenge, and it sets asymmetric penalties for late and early predictions.The letter "i" stands for i-th bearing, in fact, if there is more than one test bearing (as it often happens), it is possible to evaluate the average score of the RUL prediction for all testing bearings (A-score).A i is 1 when the per cent error Er i is 0; as the per cent error increases, the score decreases. 29À score Where: RULiðtÞ RULiðtÞ N = Cardinality of the dataset; RUL i ðtÞ = i-th actual RUL value at t-instant; d RUL i ðtÞ = i-th predicted RUL value at t-instant; RUL = mean value of the actual RUL samples in the dataset; RUL i ðλ i Þ = i-th actual RUL value at λ-instant; d RUL i ðλ i Þ = i-th predicted RUL value at λ-instant; t i = i-th time-instant; t f = time-to-failure of component; w i = i-th weight factor as a function of RUL at all time instants, that is, wi(RULi); Er i = percentage error of the i-th bearing.Note: MSE, RMSE, MAE, and CRA are bounded to [0, +∞], while MAPE, R 2 , and A i are bounded to [0, 1]; since MSE, RMSE, MAE, and MAPE are coefficients that evaluate an error, the lower the value, the greater the accuracy of the forecast.Instead, for R 2 , CRA, and A i , the higher the metric, the better the prediction performance.10  International Journal of Engineering Business Management 8%, i.e., 8 times), R 2 and MSE (7.4%, i.e., 5 times each), and CRA (1.5%, i.e., 1 time).Table 5 below summarizes the main characteristics of the selected papers sorted by industrial application type, in terms of article, received citations, objective, industrial application(s), dataset(s), ML technique(s), ML task types, and KPI(s) used.

PHM framework for the "own-dataset papers"
Figure 8 shows the analyzed papers' diagnostics and prognostics distribution and how regression and classification tasks are allocated to them.It is possible to note that Prognostics overtakes its counterpart with 70% of the papers (divided by classification end regression tasks) versus 30% of the papers which face diagnostics (only through classification task).However, prognostics and diagnostics percentages, shown in Figure 8, could be misleading because they are not necessarily related to the real manufacturing industry prognostics and diagnostics data, but rather they are related to the problem of the complexity of monitoring and analyzing data through IoT devices for industries, that led to the use of pre-existing datasets just to find the best ML algorithms proposed by the 50 papers' authors.This is the reason why the papers that present datasets created specifically for the task addressed by their authors (own-dataset papers) are further investigated in this section; from Table 5 it is possible to extrapolate that an own-dataset has been used in 19 of the 50 analyzed studies.As a first step, to better understand the real partition between prognostics and diagnostics in the industrial field, Figure 9 shows the own-dataset papers' diagnostics and prognostics distribution related to regression and classification ML tasks.Comparing Figures 8 and 9, it emerges that both the prognostics and the diagnostics trends are confirmed, that is, a clear predominance both of Prognostics on Diagnostics and Regression on Classification.Polverino et al.
Therefore, Figure 10 below shows a single common PHM framework which describes the step-by-step diagnostics and prognostics process carried out by the authors of the 19 own-dataset papers.It is worth noting that the path is not unique since some steps could be repeated for the diagnostic and prognostic tasks, for example, although the prognostic step relies on the results of the diagnostic step, it may be necessary to perform steps from 2 to 5 again since the task purpose is changed.Moreover, step 8 could follow both steps 6 and 7.The aforementioned steps are described as follows: Data acquisition.The raw data (vibrations, temperatures, pressures, acoustic emissions, etc.) are acquired time by time by the sensors installed on the critical components in laboratories' test platforms.Depending on the type of SSCs, the variables, analyzed by the sensors, change.Vibration seems to be the most analyzed variable for bearings, [31][32][33]43,52,54,67 followed by temperature, 55 and oil supply pressure, pressure applied to bearings and lubrification oil flow; 64 moreover, vibration, cutting force, and acoustic signals are the constant variables analyzed by the sensors for milling machine's cutting tool; [75][76][77]79 vibration is the only analyzed variable for gears; 10,70,73 strip temperature, strip thickness, strip width, strip flatness, and roller gap are used to analyze the degradation performance of the hot stream mill's roller; 74 finally, for the ball screw, vibration and position of the screw are used to evaluate its wear state. 30 A singulr case concerns the paper, 66  Time-domain is based on converting raw data into statistical features such as mean, median, standard deviation, variance, root mean square (RMS), skewness, and kurtosis.
For example, Wu et al. 73 use twelve different time-domain extraction features to form a single feature vector as an input to a neural network: VPP, standard deviation, variance, mean, RMS, ARV, form factor, crest factor, kurtosis, kurtosis factor, pulse factor, and margin factor.Other papers that adopt this type of time-domain based feature extraction are Refs.[10,75,76,43,54,79].Other time-domain feature extraction methods are Hierarchical Symbolic Analysis (HAS), 67 and a unique deep multilayer LSTM model that can fully extract the features from the monitoring raw data. 74requency-domain is about extracting statistical features by applying the Fast-Fourier-Transform (FFT) to raw data; typical statistical frequency domain features are Mean Frequency (MF), Root Mean Square Fluctuations (RMSF), Frequency Modulation (FM), Root Variance Frequency (RVF), Power Spectrum Deformation (PSD), etc; for instance, Xie et al. 31 extract frequency-domain features and use them as inputs to a DBN model.Time-frequency domain considers both time and frequency domains to capture how the frequency components of the signal vary as a function of time.It is commonly used to monitor rotating machinery state, and it is very effective for non-stationary time-series analysis.For example, the vibration signal of a bearing is non-stationary and has a weak defect signal within a strong background of noise.80  Four further cases are about bearings' prognostics, 52,64 bearings and gears' diagnostics, 66 and gears' diagnostics, 70 in which both frequency and time domains are investigated  FI is based on simply finding the best features' sub-set according to the specified objective of diagnostics or prognostics through several statistical methods, such as correlation, time-series, chi-square test, and others; unlike the following two methods, FI does not use ML algorithms to perform the PHM task, therefore it allows to have a sub-set of features more versatile, to be then employed by numerous ML algorithms.For example, Saravanakumar et al. 66  WR is based on a specific ML algorithm that has to fit a given dataset.The evaluation criterion is simply linked to the classic ML performance metrics, including those described in sub-section 3.3.Wrappers are usually able to achieve better performances than FI-based techniques since they are optimized for a specific ML algorithm which is in turn tailored for a specific task.On the other hand, wrappers are biased toward the ML algorithm they are based on and therefore the resulting feature sub-set is not very versatile, that is, it will not be generally adequate for alternative ML techniques.7 For example, to automatically select and classify the most informative features, Marei et al. 79    the performance of ML model for PHM through the already described KPIs in Table 3 and Table 4. Itis not necessary to show the EM used in the 19 own-dataset papers, because the choice of KPIs for the evaluation of ML algorithms does not depend on the type of dataset used by the authors (own or online free datasets), but on the ML task (classification and regression).Therefore, Figure 7 already contains the necessary information to understand which EMs are used the most.
Table 6 summarizes the results described in this section about the 19 own-dataset papers regarding the framework showed in Figure 10; it is sorted by industrial application type and the paper's objective.

Conclusions
A SLR about the PHM of industrial mechanical systems and equipment was carried out.The focus concerned the most used ML algorithms in diagnostics and prognostics field, and the related KPIs employed for validating them.A literature search on the Scopus database led to 50 studies eligible for the above-mentioned analyses, 31 of which present common public datasets, and the remaining 19 present own datasets, i.e., datasets created specifically for the task addressed and the industrial application used by the authors.Concerning the family of ML algorithms, DL ones result to be the most used.Moreover, among the DL techniques, CNN and RNN resulted as to be the most applied, while RF is predominant among SL techniques.Regarding the KPIs, Accuracy resulted to be largely the most used for ML classification tasks, while for ML regression tasks, the frequency of the KPIs results to be more balanced with RMSE, MAE, A i , and MAPE.Later, a further detailed analysis has been carried out with the aim of finding a common PHM framework which describes the step-bystep Diagnostics and Prognostics process carried out by the authors of the 19 own-dataset papers.This analysis aims to provide the reader a common practice for the best choice of the ML algorithms and the related evaluation metrics for manufacturing industry.
Overall, by the analyses carried out in this paper, it resulted that research is moving towards the use of more recent DL techniques, rather than the classic SL algorithms, although DL methods are more complex to build and require the so-called "big Data," not always available.On the other hand, the automated end-to-end feature extraction, together with an improved capacity of generalization has led to a large-scale replacement of the traditional SL architectures for DL ones.
The main limitation of this SLR is about the industrial mechanical systems and equipment's field of application; in fact, other industrial fields, such as aeronautical, chemical, robotics, and railway fields have been excluded.Therefore, future studies may fill this gap.

Figure 1 .
Figure 1.Overview of the literature identification process.

Figure 2 .
Figure 2. Publication and citations trend per year.

Figure 4 .
Figure 4. Datasets used in the analyzed papers.

Figure 5 .
Figure 5. Frequency of the shallow learning, deep learning, and hybrid algorithms used in the 50 analyzed papers.

Figure 6 .
Figure 6.Types of shallow learning (a), Deep learning (b), and hybrid (c) algorithms used in the 50 analyzed papers.

TP
= positive class samples correctly predicted; TN = negative class samples correctly predicted; FP = positive class samples incorrectly predicted; FN = negative class samples incorrectly predicted; TC = true class; FC = false class.

Figure 7 .
Figure 7. Frequency of the key performance indicators for classification (a) and regression (b) tasks.

Figure 8 .
Figure 8.All diagnostics and prognostics distribution related to regression and classification Machine Learning (ML) tasks.

Figure 9 .
Figure 9. Own-dataset papers' diagnostics and prognostics distribution related to regression and classification ML tasks.
where only current signals are used as raw data for bearings and gears' diagnostics.Feature extraction.The raw data are converted into statistical features usable by the specific ML algorithm.Particularly, this conversion may have three different domains: time-domain (TD), frequency-domain (FRD), and time-frequency domain (TFD).
Wavelet Transform (WT), Continuous Wavelet Transform (CWT), and Empirical Mode Decomposition (EMD) have been used to extract features from raw signals, such as in Ref. 55 where the wavelet sequences are realized using the CWT, given its capability to handle the non-stationary signals with multiscale representation, which can provide the hierarchy of structural information to show the dynamic characteristics of the vibration signals.Another example of TFD method is carried out in, Ref. 32 where 8 different TFD methods are used to extract features for bearing Fault Diagnosis.

Figure 10 .
Figure 10.Diagnostics and prognostics process followed by the 19 own-dataset papers.
use Spearman correlation to find how the extracted features are correlated with the actual RUL of bearings.Other examples of filters-based techniques are in Ref. 30 and Ref. 75.

Figure 11 .
Figure 11.Machine Learning algorithms' nature used in the 19 own-dataset papers.

Table 1 .
Number of papers related to the most relevant journals.
Figure 3. Structures, systems, or components used in the analyzed papers.

Table 2 .
Machine learning algorithms used in the 50 sample papers.

Table 3 .
KPIs for ML classification tasks.

Table 4 .
KPIs for ML regression tasks.RUL i ðtÞ À RUL i ðtÞÞ 2 (8) It is the average of the squares of the errors, that is, the average squared difference between the predicted and the actual RUL values at i-th time-instant.

Table 5 .
Characteristics of the selected studies.

Table 5 .
(continued) employ a CNN model, using then test accuracy to get feedback about the performance of the feature section.EMM presents the feature extraction process into the ML algorithm, which is able to pull out the most representative features from the extracted features' sub-set.It is possible to find examples of the embedded approach in, Ref.31where an adaptive DBN optimized by the Nesterov Moment (NM) is used to extract features from rotating machinery and recognize bearing fault types and degrees simultaneously, or in Ref.43 and, Ref. 73 where

Table 6 .
10aracteristics of the 19 own-dataset papers.thecomplexprocess of feature selection is compressed into a single deep learning algorithm (CNN) which is able to learn how to select features directly from the original vibration signals in order to predict RUL43or diagnose faults.73Otherexamples of the EMM are in Refs.10,32,33,52,54,55,64,67,70-77].-HealthIndicator creation.Sometimes, the features sub-set is converted into one only health indicator through dimension reduction approaches before being consigned as input to the ML algorithm.For instance, Deutsch et al.10combine the 6 extracted TD based features (RMS, energy operator RMS, FM0, narrowband kurtosis, amplitude modulation kurtosis, and frequency modulation RMS) into a 1-D HI to predict the RUL of a gear.Other examples of HI creations are in Refs.[66,74] and Ref. [77]; -ML model application.The selected features sub-set is divided into two sub-sub-datasets (training and testing) used to train the SL or DL models and predict RUL or diagnose faults of the SSCs. Figure 11 shows the ML algorithms' nature used by the 19 papers' authors, classifying them for algorithm family (SL and DL), and PHM task type (Diagnostics and Prognostics).-Diagnostics.It directly refers to faults' diagnosis of the SSCs.As shown in Figure 11, a SL hybrid method (ANFIS) together with 3 different DL methods (RBM, AE, and CNN) have been used 1 time each in the 19 own-papers to diagnose faults, with the predominance use of CNN (3 times in the 19 own-papers).-Prognostics.It directly refers to RUL prediction of the SSCs.As shown in Figure 11, numerous ML methods have been used to carry out prognosis in PHM field, such as three different SL methods (SVR, ANN, and RF), a hybrid DL/SL method (DBN-FNN), and four different DL methods, that is, a hybrid one (CNN-LSTM), RBM, CNN, and RNN; the latter is predominant, having been used 6 times in the 19 own papers.-ML model evaluation.The final step is about evaluating