Using a dynamically selective support vector data description model to discover novelties in the control system of electric arc furnace

As increasing data-driven control strategies are applied in electric arc furnace systems, the problem of novelty detection has drawn more attentions than before. The presence of outliers should be the main obstacle in practical applications for these advanced control techniques. To this end, this paper proposes a dynamically selective support vector data description model to discover novelties in electric arc furnace. In this model, support vector data description plays the role of base detector. Artificial outliers are generated with two objectives, one is to assist the dynamic selection, and the other is to optimize two parameters of support vector data description. Then clustering technique is used to determine the validation set for each test point. Finally, a probabilistic method is used to compute the competence of base detectors. In contrast to other novelty ensembles that have parallel structures, our ensemble model has a dynamic selection mechanism that could facilitate the mining of the potential of base detectors. Three synthetic and three real-world datasets are used to validate the effectiveness of the proposed detection model. Experimental results have approved our method by comparing it with several competitors.


Introduction
Novelty detection or outlier/anomaly detection techniques have been applied in many practical domains, such as fraud detection for credit cards, intrusion detection for cyber-security, faulty detection for industrial systems, to name but a few. [1][2][3] However, developing dedicated novelty detectors for industrial control systems has rarely been taken into account, let alone for an electric arc furnace (EAF) control system. During recent years, many industrial systems (including EAF) have introduced data-driven techniques to facilitate the modeling and control processes as more and more process data could be collected and stored. Inspired by this, novelty detection is drawing increasing attentions in industrial systems, because anomalous observations have adverse impact on both the modeling and the control process that use any data-driven technique. In the control system of EAF, outliers are referred to as observations that cannot reflect the normal system states. 4 Recently, several advanced control strategies such as adaptive control and model predictive control have been proposed to apply to EAF systems, in order to improve the control performance and save energy. [5][6][7][8] We have noticed that in these control methods, some machine learning algorithms like neural network have been used to establish the process model of EAF. It is well known that these data-driven models are often very sensitive to outliers in training sets and the resultant control performance will deteriorated when they have been included. (This should be the main reason why these advanced data-driven control strategies have not been extensively used in practical EAF control systems.) In this situation, implementing an efficient novelty detector on the EAF control system may be beneficial. It is noteworthy that this novelty detector has subtle distinct with a process monitoring model. Here, we are only interested in the controlled variables or variables that would be used by data-driven control strategies. In contrast, process monitoring concerns more about the state variables of systems.
In spite that rare dedicated detectors in literature have been proposed for EAF, there still many existing techniques in machine learning and data mining that can be used here. According to the availability of supervision, these novelty detectors are often categorized into three groups, that is, supervised detectors, unsupervised detectors, and semi-supervised detectors. Supervised detectors use labeled training data to learn conventional classifiers, such as support vector machine (SVM) and decision tree, that can separate normal observations from anomalous ones. A crucial drawback of this type of detectors resides on the need of labeled training data, which is of great challenges for most practical applications including EAF, because labeling process data is a time-and human-consuming work. In contrast, unsupervised detectors can use some similarity criteria like distance and density to mine potential outliers in databases. Therefore, these detectors are usually used in off-line ways and most presentative methods should be distance-based detector and local outlier factor (LOF). 9,10 Semi-supervised detectors are also referred to as one-class (OC) classifiers and data description techniques. The basic idea of this type of detectors is that a normal pattern can be learnt since all training samples are assumed from the target class. In this paper, we will often use the term OC classifiers to denote semi-supervised detectors. The most representative OC classifiers should be support vector data description (SVDD), which aims to enclose all training data with a hypersphere whose volume should be as small as possible. 11 Based on the mechanisms of these detectors, those semi-supervised ones are more appropriate for novelty detection in an online manner. Then in order to improve the performance of single detectors, several ensemble models have been proposed. [12][13][14] By combining diverse base detectors, the final detection performance could be enhanced. Note that a notable limitation of these ensemble models is that the used base detectors should be accurate and diverse simultaneously. This assumption can hardly be satisfied when we have none data labels. In this paper, we propose a dynamic selective model that uses SVDD as base detectors. Our detector can also be deemed an ensemble model as several base learners are necessary. In contrast to above ensemble detectors where a fix model structure will be used for all test points, our detector will select the most competent base detector for each test point dynamically. In order to facilitate the selective procedure, we use a trick to generate artificial outliers. Moreover, these outliers will also be used to obtain an optimal parameters of algorithm SVDD in our detector. We have noted that this problem has always been ignored by many researches that use SVDD. In addition, clustering technique is also used in the selective process to determine validation set for each test point.
Here, we conclude the contributions as follows: A dynamic novelty detection model is proposed for EAF control system.
Artificial outlier examples are generated to determine validation sets and optimize parameters of SVDD. Datasets from real-world EAF system are used to verify the effectiveness of our detection model.
The rest of paper is organized as follows. Some related works and preliminaries will be presented in section ''Related works and preliminaries.'' The proposed method will be introduced in section ''Methodology,'' followed by the experiments in section ''Experiments and analysis.'' Finally, some conclusions will be drawn in section ''Conclusion.''

Related works and preliminaries
Several related works regarding novelty detection in EAF systems will be introduced in spite of their sparseness. Then some necessary preliminaries will also be presented simply.

Related work
In Liu et al., 4 a model-based novelty detection model is proposed for the process control system of EAF. In this model, an improved Radial Basis Function (RBF) neural network is first used to establish the process model of EAF. Then hidden Markov model (HMM) is used to analyze the residuals between the true measurements and the results of this process model. From our point of view, the main drawback of this method is that it can only be used for univariate dataset. For multivariate datasets, several such models may be necessary. Then it heavily depends on the predictive model, and the detection performance will deteriorate much when the predictions are biased. In Wang and Mao, 12 a clustering-based ensemble detector is proposed for EAF control system. In this method, a clustering algorithm is used first to separate the training set into several subsets, in each of which a single detector is established. Then any test point will be labeled as an outlier if it rejected by all base detectors. In Wang and Mao, 13 technique Random Subspace (RS) is used to develop an ensemble detector. In this method, RS is used first to divide the feature space into several subspaces, then all training points will be projected onto these subspaces to generate several training subsets, on which corresponding base detectors can be trained. Then a combination rule is used to derive the ultimate result for each test point. As we have mentioned previously that a main drawback of these ensemble detectors is that generating accurate and diverse base detectors may be difficult for some situations.
In fields of machine learning and data mining, novelty detection is always a hot topic since outliers often indicate interesting data patterns. In general, existing novelty detection methods can be categorized into probabilistic detection models, distance-based detection models, reconstruction-based detection models, and domain-based detection models. 15 For probabilistic detection models, Gaussian mixture model (GMM) should be one of the most popular parametric ones, and HMM and Kalman filter are another two commonly used parametric ones. Kernel density estimation should be the most popular non-parametric detection method. For distance-based detection models, methods based on nearest neighbors and clustering technique are often used in many applications. Then neural network (NN) based and principal components analysis (PCA) should be two commonly used reconstruction-based detection models. Finally, SVDD and OC SVM are two representative methods of domain-based detectors.

Preliminaries
Basic concepts concerning SVDD and ensemble learning will be introduced simply.

SVDD.
Algorithm SVDD defines a model by using a hypersphere to give a closed boundary around all observations in the training set. This hypersphere can be characterized by center a and radius a in original feature space. 11 In order to enlarge the possibility of outliers in the training set, a slack variable j i is introduced. In addition, kernel trick f(x i ) is also used to make the description more flexible. The minimization problem becomes hence the following min R, a, j where the parameter C controls the balance between the volume of the hypersphere and errors on the target class. Larger value of C implies that less training points should be inside the hypersphere. Note that the above convex optimization problem can be solved through its dual form By setting partial derivations to zero, the dual optimization problem has changed into s:t: Solution of this dual optimization problem are a set of values of a i . Objects x i with a i . 0 are referred to as support vectors (SVs), with which the center a can be calculated. Furthermore, SVs on the boundary can be used to calculate the radius where SV \ C is the set of SVs with 0 \ a i \ C.
For any test point x, the distance from x to the center a is derived by If d is larger than R, sample x is labeled as an outlier or a fault.
Ensemble of OC classifiers. Designing parallel ensembles of OC classifiers is easier than sequential ones because those techniques used in conventional classification problem can be directly used here. The rationale of parallel ensemble resides in reducing the variance by inducing diverse base detectors or OC classifiers and aggregating them. Several strategies could be used to enhance this diversity. The most commonly used technique should be Bagging that uses a bootstrap sampling. By fusing individuals learnt on different training subsets, Bagging is expected to obtain more robust result. Another well-known strategy of enhancing diversity is to use different feature subsets. RS and feature bagging (FB) should be the most commonly used techniques of this type. Note that subspace-based outlier ensembles are more efficient on high-dimensional datasets since outliers could be easily masked there and may be exposed in certain subspace. Apart from these two strategies, using different model parameters (or initializations) and even different algorithms have also been proposed to enhance diversity of outlier ensembles.
EAF. The EAF is a highly energy-intensive process used to convert scrap metal into molten steel. EAFs range in capacity from a few tons to as many as 400 tons. Figure 1 gives a simple description of the EAF operation. The graphite electrodes that connect to the electrical supply could convert the electrical energy into thermal energy via the electric arcs between electrodes and the steel scrap surface. In addition, natural gas and oxygen would also been injected into the furnace so that the releasing chemical energy could be converted into the thermal energy. The scrap keeps melting by absorbing extensive thermal energy. Therefore, the scrap surface becomes irregular as parts of the scrap melt and removed, leading to the contours of the scrap surface. The corresponding disturbances will occur for the arc length. Then the electrode regulate system will response to these disturbances by adjusting the distance from the electrodes to the scrap surface so that the optimal arc length can be obtained. We should also note that when sufficient space is available inside the furnace, another scrap charge will be added. Then the melting process will proceed until a flat batch of molten steel is formed at the end of the batch.

Base detectors
As discussed previously that our detection model can also be deemed an ensemble model, the generation of base detectors is thus indispensable. In order to train diverse and accurate SVDD models in our ensemble, a subspace-based ensemble technique called FB 17 is used. It is observed that subspace-based ensemble techniques are often more efficient than subsampling techniques in unsupervised learning, even the dimension of the given data is not very high. 18 The basic steps of FB can be described as follows: 1. Sample an integer r from d=2 to d À 1; 2. Select r dimensions from the data D randomly to create an r-dimensional projection; 3. Use base detector on projected representation; 4. Repeat the above steps until M iterations.
The rationale is to sample an integer r between d=2 and d À 1, and r dimensions are then sampled from the dataset. The data are finally scored in this projection.

Artificial outliers
Before introducing the generation of artificial outliers, we first give a simple description of dynamic ensemble, by which the goal of artificial outliers can hence be explained. A general training and test process of dynamic ensemble learning can be demonstrated in Figure 2.
Once base detectors have been trained, a selective procedure will be implemented for each test point. We can see from Figure 2 that a validation set is necessary to complete the selection. The objective of this validation set is to provide reference examples in order to identify the competence of all base detectors, with respect to this test point. Then the most competent base detector(s) will be selected according to the result of competence calculations.
From the above description of dynamic ensemble, we can find that the role of validation set is very critical as it is the premise of competence calculation. However, we have none labeled training examples in novelty detection to constitute this validation set. As a result, we have to generate some artificial ones instead. A simple strategy of generating artificial outliers is by sampling examples from a bounded uniform distribution. Another strategy is to assume that outlier examples locate in sparse regions of the target domain, that is, regions where the target data are either absent or isolated from the rest of the data. 19 Fan et al. 20 propose to generate outliers close to the target data by constraining the learning algorithm to form an accurate boundary between known classes and anomalies. The value of one feature of a target point is changed randomly while leaving other features unchanged. Major drawbacks of these two methods are the impossibility to generate a sufficient amount of outlier examples in high-dimensional situations due to the curse of dimensionality. To this end, Tax and Duin 21 propose to generate a uniform hyper-spherical outlier distribution that might fit tighter around the target class than a hyper-box distribution. However, such an approach to generate artificial outliers is mainly used to optimize parameters of OC classifiers, and the resultant outlier and Bagging are used to subsample features and training set, respectively. Then the amount of required outliers has been reduced much more than the original version. Furthermore, sparsity information extracted from the original training set is also used to make the artificial outlier class complementary to the target class. Experimental results on several benchmark datasets have shown the superiority of this method. Inspired by this, we also use such a strategy to generate outliers in this paper. But some adjustments are made here. In particular, RSM and Bagging will not be used to sample features and training set, and only FB is used instead since it is also used to train base detectors.

Score normalization
Before the selective procedure, a normalization procedure for all base detectors is of great necessity in order to achieve an unbiased selection. As has been proved that even when using the same method as base detector and identical parameterization, outlier scores obtained from different subspaces could vary considerably, if some subspaces have largely different scales. 23 For techniques concerning outlier score normalization, those converting outlier scores of different base detectors into probability estimates are more acknowledged. As claimed by Gao and Tan, 24 there are many advantages to transforming outlier scores into well-calibrated probability estimates, and a dominant one is that the probability estimates are more appropriate for developing an ensemble outlier detection framework. Sigmoid function and mixture modeling are accordingly used to fit outlier scores into probability values in this study. While in Kriegel et al., 25 a more general framework of outlier score normalization is provided. The fundamental motivation of this framework is to establish sufficient contrast between outlier scores and inlier scores so that outliers could be easily separated from inliers. This seems more practical than only the interpretation of outlier scores because we actually need to pick out outliers in some applications. However, we may encounter a problem if we directly use these normalization methods. Note that normalization methods in Gao and Tan 24 and Kriegel et al. 25 are used for mining outliers in given databases. The normalization or scaling procedures are implemented with samples only in the given database. When using these procedures for unseen test samples, the probability of an observation being an outlier may be out of the range of ½0, 1, leading to the loss of interpretability of probability estimates.
To address this problem, we first do some adjustments on outputs of base detectors before converting to probabilistic estimates. Generally, normalizing them to Z-scores is a good choice, and its effectiveness has been verified in Aggarwal and Sathe 26 and Nguyen et al. 27 However, space of improvement still exists like that using Z-score for outlier detection for univariate data (''3 s edit rule''). As we all know that Z-scores are sensitive to outliers because the presence of outliers tends to inflate the variance estimate. Z-scores of all normal data would therefore move toward those of outliers, making the contrast between outliers and inliers smaller. As we have emphasized that sufficient contrast between outliers and inliers is preferred by most scenarios of outlier or novelty detection. In light of this, we propose to adjust the calculation of Z-score just like the improvement by ''Hample Identifier,'' which replaces the outlier-sensitive mean and standard deviation estimates with the outlier-resistant median and median absolute deviation from the median (MAD), respectively. 28 Dynamic selection Validation set. In general, algorithm K-nearest neighborhood (KNN) is always used to determine the validation set of any test point in dynamic ensembles. However, calculating the distances to all data points in the training set is necessary to find its nearest K neighbors. The cost of computation is too expensive for online application sometimes. While for the clustering-based method only the determination of its belonging cluster is necessary so long as all data points have been divided into several clusters at the training phase. As a result, we prefer to use a clustering algorithm to determine the validation set for each test point.
Here, we choose three representative clustering algorithms as candidates and do some quantitative comparisons. The first one is the classical K-means clustering, the second one is GMM, the third one is a densitybased algorithm named DBSCAN (density-based spatial clustering of applications with noise). K-means should be the most popular clustering algorithm due to its simple theory and implementation. 29 Its drawbacks mainly reside on its sensitivity to initial values, noise, and outliers. Correspondingly, several improved versions have also been proposed. GMMs are among the most statistically mature methods for clustering. Each cluster is represented by a Gaussian distribution. The clustering process thereby turns to estimate the parameters of the Gaussian mixture, usually by the Expectation-Maximization algorithm. 30 Its probabilistic form of output may be an advantage, which make GMM clustering can be combined with other statistical learning models more smoothly and naturally. But its drawbacks lie on its probabilistic assumption, which also requests more on the size of samples and representativeness. The largest advantage of DBSCAN should reside on its ability of discovering clusters of arbitrary shape. 31 It also requires less input parameter, including the number of clusters. While it becomes unstable when detecting border objects of adjacent clusters.
The quantitative criterion we use is Calinski-Harabasz (CH). 32 This quantity is defined as where SS B denotes the overall between-cluster variance, SS W is the overall within-cluster variance, k is the number of clusters, and N is the number of observations. Then SS B and SS W are defined as where n i indicates the number of observations in cluster i, m i is the centroid of cluster i, m is the overall mean of the sample data, jjm i À mjj 2 is the L 2 norm between m i and m, and jjx À m i jj 2 is the L 2 norm between x and m i . Note that the CH index can be deemed the ratio of between-cluster variance and within-cluster variance, and larger CH value indicates better data partition. The optimal number of clusters can be determined by maximizing CH k with respect to k. When classifying a test point, we should decide its belonging cluster first. Then all data points in that cluster constitute the validation set with respect to this test point.
Competence calculation. It is desired that we could select the most competent base detector by computing their competences through the validation sets. With the artificial outliers in the training set, we can employ selection mechanisms proposed traditional classification problem. More than just estimating the classifier accuracy on the basis of a simple percentage of corrected classified samples, here we use a probabilistic measure to select the most competent classifier.
Let V = fX 1 t , . . . , X K t g denote the validation set of test pattern X t . For each base classifier C j , the probability of correct classification of any test pattern can be estimated by (if class labels of training patterns are available)p where N j is the number of neighbor patterns that are correctly classified by classifier C j . Assume that a neighbor pattern X i t 2 v l , l 2 fO, Mg (O, M denotes the outlier and normal class, respectively), then P j (v l jX i t ) provided by classifier C j can be deemed its measure of competence for X i t . As a result, the competence of classifier C j on pattern X t can be derived by averaging competences on all neighbor patternŝ where v i l is the label of X i t . Then a weight can be assigned to each neighbor pattern in order to reduce the uncertainty in the definition of the neighbor sizê p correct j À Á = where W i = 1=d i , and d i is the distance from pattern X i t to X t . As outputs of base classifiers have been transformed into posterior probability estimates, we can exploit this information to measure classifier competencê where N ll is the number of neighbor patterns that have been correctly classified by C j to class v l , and P j2 M, O f g N jl is the total number of neighbor patterns that have been classified to class v l by classifiers C j . Then we exploit the posterior probabilities of neighbor patterns according to the Bayesian theorŷ The termp(C j (X t ) = v l jX t 2 v l ) indicates the probability that patterns belonging to class v l are correctly classified. It can be estimated aŝ The termp(v l ) denotes the prior probability of class v l , which can be estimated aŝ Then a weight is also assigned to each neighbor pattern to reduce the uncertainty triggered by the neighbor size. Finally, competence of classifier C j on test pattern X t can be obtained We conclude this procedure in Algorithm 1.

Optimization of SVDD
In algorithm SVDD, two parameters v and s need to be determined a prior. Parameter v is defined as where N is the number of training data, and C indicates the trade-off parameter. It has been proved in Tax and Duin 11 that parameter v is an upper bound for the fraction of target class objects outside the description. In addition, the fraction of objects which become SV is a leave-one-out estimate of the error on the target set where #SV indicates the number of SVs. Using this equation, one parameter (v or s) can be optimized. Unfortunately, this will not uniquely define both parameters. Minimizing just the error on the target set is not sufficient. To estimate the outlier acceptance rate without outlier examples, we have to assume an outlier distribution. In section ''Artificial outliers,'' some artificial outliers have been generated. With these outlier examples, we can optimize both parameters simultaneously by minimizing the following error term where f + O denotes the fraction of accepted outlier examples. Therefore, parameter l indicates a trade-off between target error and outlier error. When we take l = 1=2, then errors on the fraction of target and outlier observations are weighted equally.

Datasets
In EAF control systems, three secondary current and three secondary voltage are often used by data-driven control strategies. For example, in Li and Mao, 6 these six variables are used to identify the process model of EAF. In this paper, we also use these variables to constitute the training and test sets. Totally, we will use six datasets, including three synthetic ones by the simulation model and three real-world ones. A simple description of these datasets is shown in Table 1.
The first three datasets are generated using the simulation model, and different faults are simulated in different datasets. The last three datasets are collected from real-world EAF control systems. In each dataset, 70% normal data is randomly selected to constitute the training set, and all remaining samples constitute the test set. This process will be repeated 10 times, and the averaging values will be used as the final results.  Here, we refer to our method as Dynamic selection SVDD (DS-SVDD) as it is a dynamic selective SVDD model.
In this paper, we use three metrics, that is, G-mean, F-measure, and ROC-curve (receiver operating characteristic curve). A representation of classification performance is formulated by a confusion matrix as illustrated in Table 2.
Then we can formulate G-mean as follows This metric evaluates the degree of inductive bias in terms of a ratio of positive accuracy and negative accuracy.
F-measure can be formulated as follows where b is a coefficient to adjust the relative importance of precision versus recall (usually, b = 1) F-measure combining recall and precision as a measure could provide more insight into the functionality of a classifier than the accuracy metric.
The ROC curve describes the trade-off between the true-positive rate and the false-positive rate. (Note that normal data are regarded as positive in this paper. So true-positive rate indicates the rate of correctly detected normal data.) It could thus evaluate the general performance rather than performance at only one working point. In practice, the area under the ROC curve (AUC) is used since comparing directly ROC curves of varying detectors is difficult. For a novelty detection task, the AUC value of a perfect algorithm equals 1, implying that all outliers have been identified and none misclassified normal data occur simultaneously. Algorithms with AUC values smaller than 0.5 are often deemed invalid since ''random guessing'' could obtain the AUC of 0.5. Here, we employ method in Huang and Ling 34 to calculate the AUC.

Result and analysis
Results on all six datasets with respect to three metrics are shown in Tables 3-5, respectively. Apart from the values in terms of three metrics, we also provide the averaging values over all datasets so that we can have an insight into the general performance. The comparison of these averaging values can be understood clearly in Figure 3, from which we could find that our method (DS-SVDD) has achieved the best general result on all three metrics. Then we compare DS-SVDD with its competitor one by one: DS-SVDD versus SVDD. On all six datasets, DS-SVDD has outperformed SVDD in terms of three metrics. This result implies that the improvement provided by the dynamically selective procedure in DS-SVDD is very efficient.   On the other hand, our selective mechanism can further improve the performance of FB. DS-SVDD versus BA-SVDD. Before comparing these two models, we first compare the result of BA-SVDD with that of RS-SVDD. From Figure 3, we can see that RS-SVDD has achieved better general result than BA-SVDD. On four of five datasets, RS-SVDD outperforms BA-SVDD. This comparative result indicates that subspacebased novelty ensembles are often more efficient than subsampling-based ones. With this comparison, we can easily understand why DS-SVDD outperforms BA-SVDD on all datasets. DS-SVDD versus C-SVDD. An interesting point in this pair of comparison is that C-SVDD outperforms DS-SVDD on dataset 3. In order to find the reason behind this, we first check the performance of RS-SVDD. From results in terms of three metrics, we can see that RS-SVDD has obtained the worst result on this dataset. Although a little improvement has been achieved by DS-SVDD, but this improvement has not pushed DS-SVDD to outperform C-SVDD. But on other datasets, DS-SVDD still has better performance than its competitors.

Conclusion
To facilitate the development of advanced data-driven control strategies in EAF systems, this paper proposes a dedicated novelty detection model with the help of dynamic ensemble learning theory. In this detection model, SVDD plays the role of base detector. Artificial outliers are generated with two objectives, one is to complete the dynamic selection, and the other is to optimize two parameters of SVDD. Then clustering technique is used to determine the validation set for each test point. Finally, a probabilistic method is used to compute the competence of base detectors. In order to validation the proposed detection model, we compare it with four competitors on three synthetic and three realworld datasets. We compare results of all these methods and show the superiority of our method. However, several issues regarding our method are still open to solve. For example, the procedure of generating artificial outliers may be not appropriate in some situation. When training set contains unknown outliers, the robustness of our method may be poor. These problems have not been considered in this paper, but they should be our future research directions.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.