Rolling bearing fault diagnosis based on probabilistic mixture model and semi-supervised ladder network

Fault diagnosis of rolling bearings is of great significance to ensure the production efficiency of rotating machinery as well as personal safety. In recent years, machine learning has shown great potential in signal feature extraction and pattern recognition, and it is superior to traditional fault diagnosis methods in dealing with big data. However, most of the current intelligent diagnostic methods are based on the ideal conditions that bearing data set and label information are sufficient, which are often not always available in engineering practice. In response to this problem, this paper proposes to use probabilistic mixture model (PMM) to approximate the data distribution of the bearing signal, and then use Markov Chain Monte Carlo (MCMC) algorithm to sample the probabilistic model to expand the fault data set. In addition, Semi-supervised Ladder Network (SSLN) can achieve the effect of supervised learning classifier with only a few labeled samples. Based on Case Western Reserve University (CWRU) Bearing Database, the recognition accuracy of the proposed PMM-SSLN model can reach 99.5%, and the experimental results show that this model is applicable to the case where both bearing data set and label information are insufficient.


Introduction
Rolling bearings are the most important and most vulnerable component of the rotating machinery and widely used in machinery industry. 1 Thus, fault diagnosis of rolling bearings not only improves the production efficiency and the service life of the equipment, but also can timely detect and troubleshoot faults to prevent serious accidents. With the rapid development of the industrial Internet, the scale of the mechanical equipment group is expanding, and the monitoring points of the monitoring system are increasing, which makes the monitoring data in the whole field of mechanical fault diagnosis increase explosively. Efficiently mining of the rich information contained in big data has become the focus and challenge of the research. 2 The traditional signal processing methods of rolling bearing include time domain analysis, frequency domain analysis and time-frequency domain analysis. Doguer and Strackeljan 3 investigated an approach that suitable time domain features are selected by the higher derivatives of the time acceleration signal and some parameters characterizing the randomness of the peak positions. Frequency domain analysis converts the signal from the time domain to the frequency domain, which gives more accurate frequency components of the signal components. Zhang et al. 4 presents an improved local cepstrum method to analyze vibration signals of rolling element bearing, and the method has strong ability to resist noise and detect the early fault. The frequency domain analysis shows great limitations in the face of non-stationary signals. In contrast, timefrequency analysis can describe the time-varying frequency characteristics of the signal. Wavelet trans-form (WT) can automatically adapt to the requirements of time-frequency signal analysis, and the time-frequency analysis method based on WT has proved its effectiveness in the face of non-stationary signals. 5 Traditional time-frequency analysis techniques rely on complex signal processing techniques and fault diagnosis experience. In addition, the increasing amount of data brings the difficulties and challenges of fault diagnosis. In recent years, machine learning has become a powerful tool for big data analysis and is widely used in machinery fault detection and diagnosis. 6 Jian et al. 7 proposed a one-dimensional fusion neural network (OFNN), realizing the identification of bearing health under different load conditions. Liu et al. 8 presented a novel bearing fault diagnosis with RNN in the form of an autoencoder, and this method has high robustness and classification accuracy. Haidong S et al. 9 combined deep wavelet auto-encoder (DWAE) with extreme learning machine (ELM), diagnostic accuracy can reach 95% through experiments on CWRU motor bearing database.
However, these studies are based on the assumption that there is sufficient valid data and label information, which is inconsistent with the characteristics of monitoring data in engineering practice. The following two weaknesses are as follows: In engineering practice, rolling bearings are in the process of normal operation for a long time, so the scarce fault data are inherent features in the field of rolling bearing fault diagnosis. However, it is necessary for machine learning to have sufficient samples for each label. 10 The equipment accumulates massive data during long-term operation, but only a few data are known for their health status. But supervised learning requires a large amount of labeled data, which is likely to waste useful information of the unlabeled ones. In the face of such massive unlabeled data, manual labeling is time-consuming.
To overcome the above weaknesses, this paper proposes PMM-SSLN framework based on Probability Mixture Model and Semi-supervised Ladder Network. This paper takes into account the data characteristics of rolling bearings in engineering practice, the dataset is expanded through sampling the probability model established by the bearing data, and then uses SSLN for training.
PMM is one of the most commonly used tools for clustering and density estimation, [11][12][13] which can approximate any unknown distribution with multiple components. No matter how complicated the data distribution is, the local characteristics of the data can be described by adding components, the component here is an independent probability distribution. Compared with a single model, it is more flexible and adaptable. SSLN is a pure semi-supervised learning method proposed by Ramus et al. 14,15 Based on the traditional unsupervised learning network structure, horizontal lateral connections are added between the encoder and decoder of each layer to achieve compatibility with supervised learning. 16 It can greatly reduce the cost of labeling bearing samples, and use unlabeled data to improve the performance of the classifier.
The contributions of the paper lies in the following two aspects: PMM is applied to approximate the data distribution of complex bearing signal, and then MCMC algorithm is adopted to sample the model. In this way, large amounts of fake data with the same distribution can be generated to assist in training the neural network. This method can provide a new idea for handling small sample size problem in the intelligent diagnosis of rolling bearings. This paper uses SSLN that is pure semisupervised learning to train the expanded data set. This method can fully exploit the information hidden by the unlabeled data and can achieve the effect of a supervised learning classifier with only a small amount of labeled data. This solution improves the utilization of bearing signals, and thereby improve the accuracy of diagnosis.
The reminder of this paper is organized as follows: the mathematical model of monitored data is introduced in section ''Mathematical model of monitored data,'' estimation and data reproduction from the probabilistic model is introduced in section ''Estimation and data reproduction from PMM''; section ''Fault diagnosis using semi-supervised ladder network'' introduces

Probabilistic model of observed data
The collected data from rolling bearings including noise can be regarded as stochastic signal, which cannot be described by a fixed mathematical relationship. Each observation only represents one of its possible results in the variation range, and the variation is subject to statistical law. As a statistic point of view, the data can be considered to be a random variable which obeys a certain probability distribution. Figure 1 shows the scatter diagram of C data collected from a rolling bearing in C sampling periods. Due to various external interference, complex vibrations will occur during the bearing operation. As a result, the monitored data usually do not follow a single probability distribution. PMM is more reasonable than a single model because it can use multiple components to describe a complex distribution, this feature makes PMM suitable for building probabilistic models of rolling bearing compound failures. Regardless of the complexity of the structure of the data distribution, the mixture model can add components to describe the local features of the target distribution. The components here refer to the probability density function. Thus, PMM can be employed to approximate the distribution of bearing signals to the greatest extent.
PMM is a linear weighted combination of a finite single density function. It can be generalized to the following form: Suppose the bearing data set X = X 1 , X 2 , Á Á Á , X C f g obeys a certain distribution, X is a d-dimensional random variable, and d depends on the number of physical quantities monitored in the experiment, such as vibration or temperature data, then the probability density function is: Where P X ð Þ represents a mixture model with K components, f k X ; b k ð Þ represents the probability density function of the k À th component and b k is its parameters. v k is its weight, which is the prior probability of the probability model f k X ð Þ, and Figure 2 is a schematic diagram of a one-dimensional PMM of fault data, which consists of two components. In practical applications, d, K and the types of probability distribution can be adjusted to suit different occasions, which determines the universality of PMM. The parameters to be estimated in the model are written as

Research routine
The rolling bearing fault diagnosis method based on PMM-SSLN Model consists of three steps: (1) Data acquisition and augmentation of the bearing signal model: First collect bearing signals from the experimental device, and use PMM to fit the distribution of bearing data, in order to establish the probability models for bearing data under different working conditions. Then sample the probabilistic models by MCMC algorithm to  ensure that each category of bearing data is sufficient. (2) Model training with Semi-supervised Ladder Network: After obtaining sufficient training data, SSLN is trained by using small amounts of labeled data and massive unlabeled data. Complete the training after reaching the set number of iterations.

(3) Fault identification:
Input the data set to be tested and finally output the diagnosis result.
The specific process is shown in Figure 3.

Information criterions for determining the component number
The first parameter that needs to be estimated is the number of components K. When K is closer to the number of samples C, the model can more accurately describe the empirical distribution of bearing data, but the complexity of the model will become higher. This will lead to over-fitting problem, which does not guarantee accurate fitting to the true distribution. Therefore, we need to find a criterion that can weigh the complexity of the model and the goodness of the model fitting data. Currently the most commonly used are Akaike Information Criterion (AIC) 17 and Bayesian Information Criterion (BIC), 18 which is defined as where ln L is the maximum likelihood of the model, V is the number of independently adjusted parameters within the model, and C is the number of samples. It can be seen from equations (2) and (3) that the preferred model should be the one with the smallest values of AIC and BIC. The distinction between the two kinds of information criteria lies in the different penalty factors. The penalty factor of BIC takes into account the number of samples to prevent the model from overfitting due to the large amount of samples. With the large sample size (C.300), BIC performs better than AIC. 19

Expectation-maximization algorithm for parameter estimation
After determining the number of components K, the parameter estimation is the process of solving the parameter set u. Maximum Likelihood Estimate (MLE) is the most commonly used means of parameter estimation in probability statistics. For a single distribution, it can be directly deduced by deriving the likelihood function. But for PMM, there is no way to find the maximum likelihood directly because it is unknown which component each sample comes from.
Expectation-Maximization Algorithm 20 is a method to find the maximum likelihood function of the model with parameters. For the bearing data set X = X 1 , X 2 , . . . , X C f g , the hidden variable Z is introduced to indicate which component each bearing data comes from and Z = Z 1 , Z 2 , . . . , Z K f g . The bearing data here is a discrete variable, and the log-likelihood function of the model can be written as: The purpose of EM algorithm is to find the maximum likelihood solution of equation (4). Introduce the hidden distribution Q Z ð Þ that can be considered to be the posteriori of the hidden variable Z. E-step constructs the lower bound L t of the log-likelihood function with u fixed. According to Jensen's inequality, the lower bound is reached as follows when Q Z ð Þ = P Z X, u j ð Þ: Where t represents the number of iterations, and E Q represents the mathematical expectation of the joint likelihood log P X, Z u j ð Þ for the implicit distribution Q Z ð Þ. M-step solves u with Q fixed to maximize the lower-bound, and u t is computed by equation (7) to maximize L t .
It can be proved that the likelihood function is monotonically increasing, that is, EM algorithm will finally converges to its maximum value. This maximum is the parameter estimation of PMM. In this way, the probability density function of the bearing data is obtained, and more data can be generated for data augmentation by means of the corresponding sampling method. The pseudo code of EM algorithm is as follows: MCMC algorithm for reproduction of bearing data The bearing data generation process is a mechanical discrete event system, and it can be considered as the process of sampling the established distribution model. Expanding the training data set by sampling has the advantages of easy implementation and low cost. This paper attempts to use the method of sampling PMM to improve the learning performance of deep learning networks. In order to sample on the specified probability distribution P(x), the difficulty of sampling is that the probability density function of PMM is not necessarily integrable and the cumulative distribution function does not necessarily have an inverse function. In view of this problem, a Markov process can be constructed. After multiple transitions, it will eventually converge to a stationary distribution. The nature of the Markov chain determines that P(x) is directly related to the choice of the transfer distribution k. The goal is to find a transfer distribution k so that the final convergence is the distribution P(x). The M-H algorithm is a very important MCMC sampling algorithm, which avoids a large number of samples being rejected by amplifying the acceptance rate and can quickly converge to a stationary distribution 21 .
The M-H algorithm introduces a transfer distribution k (x ! x Ã ), and its calculation formula is as follows: a(x ! x Ã ) = min 1, Where q(x Ã x) j is the probability of transition proposal, namely the probability of transition from state x to state xÃ; a(x ! x Ã ) is the probability of accepting the new state xÃ.
It can be proved that the M-H algorithm satisfies the fine balance condition: When the detailed balance condition is satisfied, it can be guaranteed that the stationary distribution that eventually converges is the target distribution P(x) after multiple transfers. As a result, the transition sequence from the beginning of convergence can be used as the simulated bearing samples. In this paper, MCMC algorithm is used to expand the training set of the neural network instead of a large number of experiments. This

Algorithm 1 Expectation-Maximization algorithm
Input: X,P X, Z u j ð Þ,P Z X, u j ð Þ,T. Output: u 1: Initialize u 0 /*T is the maximum number of iterations*/ 2: for t = 1 to T do /*E-step: calculate the expectation*/ 3: /*M-step: maximize the expectation*/ 5: u t = arg max u L u, u tÀ1 ð Þ 6: end for Algorithm 2 Metropolis-Hastings algorithm Fault diagnosis using semi-supervised ladder network The proposed data augmentation method solves the problem of insufficient data sets, so that SSLN has sufficient training data. In addition, the health information of bearing data in engineering practice is also insufficient. In the face of these huge amounts of data, labeling the health information of the bearings takes time and effort. SSLN is a derivative algorithm of the autoencoder. It adopts the structure of ladder network to realize the compatibility of unsupervised learning and supervised learning. This semi-supervised learning method can make full use of a large amount of unlabeled data, which is consistent with the high labeling cost in the field of fault diagnosis. Therefore, this paper uses SSLN as a fault diagnosis model of bearing signals.

Structure of ladder network
Semi-supervised learning is essentially a combination of unsupervised learning and supervised learning. Unsuper-vised learning requires that all input information be reconstructed as much as possible. As shown in Figure 4, autoencoder network is a typical unsupervised learning method. 22 The compressed representation of input signal x is obtained through multiple layers of encoding, and then decoded to reconstruct the original signal. This requires the highest layer to retain all the detailed features as much as possible. But the goal of supervised learning is to extract information related to classification tasks and filter out irrelevant information. This creates an inevitable conflict between unsupervised learning and supervised learning. Ladder network adds lateral connections bet-ween the encoding and decoding channels. 15 This enables the high-level to extract the features most relevant to the classification and restore other missing features through lateral connections during the decoding process. This shows that this structure is conducive to semi-supervised learning.

Structure of SSLN
SSLN is a semi-supervised network that combines ladder network and supervised learning. Now suppose there are N labeled samples x(n), t(n) (1 ł n ł N ) j f g and M un-labeled samples x(m) (N + 1 ł m ł N + M) j f g , where M ) N and the output t n ð Þ is a class label. Semisupervised learning studies how these unlabeled data can aid in training a classifier. Figure 5 shows the structure of SSLN with L layers. SSLN consists of a corrupted encoding path, a decoding path and a clean encoding path.
In the corrupted encoding path, Gaussian random noise is added on each layer to enhance the robustness of the network. The outputỹ represents the prediction classi-fication result after the encoding process with noise. The expression for each layer of the process is as follows: Where W l ð Þ is the weight matrix from layer l À 1 to layer l, h lÀ1 ð Þ is the output through the activation function of layer l À 1, n l ð Þ is Gaussian random noise, N B is a batch normalization, 24 z l ð Þ is the input after batch normalization, g l ð Þ and b l ð Þ are trainable parameters that are correction factors of batch normalization. f ð Þ is the activation function, and ReLU function is applied from layer 1 to layer L À 1. For layer L that is the predictive output layer, the softmax classification function is selected as the activation function. The supervised cost C c for the corrupted forward pass is as follow: The decoding path uses the result of the encoding process to decode the unlabeled data. Due to the addition of the lateral connection, each hidden layer of the decoding path not only obtains the information of the previous layer, but also obtains the information of the corrupted encoding path in the same layer and restores the lost information. The specific expression is as follows: samples, and l l is the weight of the l À layer cost function. The purpose of calculating b z l ð Þ BN is to reduce the noise caused by batch normalization. The model para- i can be trained to minimize the total cost C t . After the training is completed, enter the test set in the clean encoding path to obtain the classification result y.
On the basis of autoencoder, SSLN retains and improves the unsupervised cost. Thus the structural information can be extracted that comes from a large amount of unlabeled bearing data, which makes the decoding process more reconstructed. In addition, the addition of the supervised cost makes the network become a semi-supervised learning network and extract the features of the labeled samples. What's more, the introduction of lateral connections improves the classification effect of the upper layer without reducing the reconstruction performance of the decoder. These advantages ensure that SSLN has outstanding characteristics in the application of bearing fault diagnosis.

Rolling bearing experimental data description
The experiment in this section uses the most commonly used vibration signal analysis method. The data comes from Case Western Reserve University Bearing Data Center. The experimental setup is shown in Figure 6.
The test stand consists of a 2 hp motor, a torque transducer and a dynamometer. The bearing to be tested is a deep groove ball bearing, and electro-discharge machining (EDM) is used to set the faults on the inner ring, ball and outer ring of the bearing to be tested.
Single point faults are introduced into different components of the test bearing with fault diameters of 0.007, 0.014, 0.021, and 0.028 inches. The accelerometer is used to collect the rolling bearing at the driving end, and the sampling frequency is 12 kHz. Each bearing is tested under different loads (0, 1, 2, and 3 hp), and the motor speed changes between 1730 and 1797 rpm. In order to highlight the identification ability of the fault identification method, this section selects minor faults, that is, bearing data with a fault diameter of 0.007 inch. The experimental data set is shown in Table 1. The data set is divided into 4 categories and 16 sets of data. The data is processed so that the length of a single sample is 1024, and finally each group has 108 pieces of data. Each group of data is randomly divided into 81 training data and 27 testing data.

SSLN experiment setup and results
The experiment in this section builds SSLN based on fully connected layer network. The dimension of each layer of the encoding path is set to [1024, 1000, 500, 250, 250, 100,4], and the weight of the unsupervised loss item of each layer is set to [1000, 10, 0.1, 0.1, 0.1, 0.1, 0.1]. The learning rate is set to 0.02, the batch size is set to 32, and the maximum number of iterations is set to 8000.
In the training set, 32 pieces of data are randomly selected and labeled, and the remaining 1264 pieces of data are regarded as unlabeled data. Each experiment has 32 pieces of labeled data and the same test set, and 0, 316, 632, 948, 1264 pieces of unlabeled data are added in order to train the model. The training results are shown in Figure 7. When there is no unlabeled data, the encoding path of SSLN is used for supervised learning with only 32 pieces of labeled data. The accuracy rate is only 25.9%, which is equivalent to the random probability. It can be seen that the lack of label samples has a great impact on the classification effect of supervised learning. This is because a large amount of unlabeled data cannot be used, resulting in a lack of  data in the training set and over-fitting in neural network training.
The accuracy rate of semi-supervised learning when adding unlabeled data has been significantly improved. In addition, the accuracy rate is gradually increasing after gradually increasing the number of unlabeled samples. It can be inferred that the increase in the number of unlabeled data helps to improve the accuracy of SSLN fault recognition. After adding all the unlabeled data, the accuracy rate finally reached 71.8%, but still could not achieve a high recognition rate. Limited by the size of the data set, the next section continue to expand the training set through data enhancement methods.

PMM-SSLN model experiment setup and results
Based on the previous section, this section designs PMM-SSLN model to improve the fault recognition accuracy of SSLN in the case of small data sets. This section takes the inner ring fault under no external load as an example. As shown in Figure 8, the frequency distribution histogram of the fault data can be obtained. It can be roughly seen from the figure that the data distribution is dense in the middle and sparse on both sides that is similar to the Gaussian distribution. However, it does not necessarily follow a single Gaussian distribution. This experiment selects one-dimensional Gaussian Mixture Model (GMM) for probabilistic modeling of bearing data. This means that each component in PMM represents the probability density function of the Gaussian distribution. Because of the large sample size, BIC criterion is used to determine the number of components. This not only ensures the accuracy of the model, but also makes the model have a certain generalization ability, which can predict the data that is not observed.
It can be seen from Figure 9 that the value of BIC has a minimum value. After that, as the number of components increases, the BIC value slowly rises and eventually tends to be stable. It can be concluded that the fitting effect to the model reaches the best when K = 2. After determining the number of components, the parameters of the GMM are estimated through EM algorithm. As shown in Figure 10, the probability density function of the model is drawn on the basis of the frequency distribution histogram. It can be found that the GMM can fit the distribution of bearing data well compared to the single Gaussian model (GSM). As we continue to increase the number of components, the weights of new components are close to zero, which is consistent with BIC.   After establishing GMM for the observed data, much pseudo data can be generated through M-H sampling. The acceptance rate of M-H sampling is relatively high, therefore it can converge to a stable distribution more quickly. These pseudo data can be considered to have a certain similarity and homology with the original data in terms of probability. The threshold value of the number of state transitions needs to be set to ensure that the distribution of output data has converged to a stable distribution, which is set to 100,000 here. The generated data is set to three times the original samples, and its frequency distribution histogram is shown in Figure 11. It can be seen that the generated data set can fit the distribution of the established GMM, which means the sampling data and the original data are very close in probability distribution.
Then the generated data is divided into three groups, and each group is corresponded to the original data one by one according to the size order. Moreover, the generated data has the same time label as the original data. This solution allows the generated data to retain the timing information of the original signal as much as possible. The comparison between the original signal and the generated signal in the time domain is shown in Figure 12.
Use the proposed data augmentation method to expand the bearing data in each state in Table 1 to 324 pieces, and eventually 16 3 324 pieces of 1024-dimensional bearing signals can be obtained. The labeled samples are the same as the previous experiment, and the rest are unlabeled data. SSLN is trained with the expanded training set, and the parameter settings of which are consistent with the previous section. In order to further prove the effectiveness of the PMM-SSLN model, Support Vector Machines (SVM) and Semisupervised Support Vector Machines (S3VM) are introduced as comparison models. Meanwhile, the same expanded data set was used in experiment 3.
Combining the experiments in the previous section, six comparative experiments can be listed as shown in Table 2.
It can be seen that the recognition accuracy of PMM-SSLN model can eventually reach 99.5%, which is significantly improved compared to the other five methods. The SVM model which belongs to shallow learning models has a higher accuracy than the encoder of SSLN when using a small amount of label data. But when the size of the training data set increases, the diagnostic accuracy of the SSLN model is significantly improved compared to the SVM model. This proves   that SSLN has advantages in extracting information from a large number of training samples, and the method of data enhancement can significantly improve the recognition accuracy of SSLN. Moreover, it is concluded that SSLN only requires few label data, but requires large amounts of unlabeled data for training. This is consistent with the fact that the health status of most bearing signals is unknown in practical engineering applications. Moreover, the proposed data enhancement method can ensure the sufficient of unlabeled data, and further improve the diagnostic ability of SSLN. It can be seen that PMM-SSLN model proposed in this paper can solve the problem of small data volume and scarcely labels in the fault diagnosis of rolling bearings.

Conclusion
This paper proposes an intelligent fault diagnosis method based on probabilistic mixture model and Semi-supervised Ladder Network. In the case of limited valid data, the probabilistic model of the corresponding bearing data can be established through PMM, and then the data augmentation can be performed through MCMC sampling. Due to the strong expressive ability of PMM and the fast training speed of EM algorithm, this method can quickly capture the empirical distribution of rolling bearing data under different complex working conditions. Furthermore, PMM has a good generalization ability, and will not harm the prediction of unobserved data caused by over-fitting. On the other hand, there is a problem that the labeling cost of rolling bearings is high in industrial applications. SSLN that has the feature extraction capability of unlabeled data requires small amounts of labeled data to train the model. Experimental results show that the PMM-SSLN model proposed in this paper can solve the problem of lack of valid data and label information in industrial applications, and the final accuracy rate reaches 99.5% in the CWRU bearing dataset. Our future work includes further improvement of the model to better solve the problem of data imbalance, and to realize the identification of compound faults of rolling bearings.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.