A novel multi-label classification algorithm based on K-nearest neighbor and random walk

The multi-label classification problem occurs in many real-world tasks where an object is naturally associated with multiple labels, that is, concepts. The integration of the random walk approach in the multi-label classification methods attracts many researchers’ sight. One challenge of using the random walk-based multi-label classification algorithms is to construct a random walk graph for the multi-label classification algorithms, which may lead to poor classification quality and high algorithm complexity. In this article, we propose a novel multi-label classification algorithm based on the random walk graph and the K-nearest neighbor algorithm (named MLRWKNN). This method constructs the vertices set of a random walk graph for the K-nearest neighbor training samples of certain test data and the edge set of correlations among labels of the training samples, thus considerably reducing the overhead of time and space. The proposed method improves the similarity measurement by differentiating and integrating the discrete and continuous features, which reflect the relationships between instances more accurately. A label predicted method is devised to reduce the subjectivity of the traditional threshold method. The experimental results with four metrics demonstrate that the proposed method outperforms the seven state-of-the-art multi-label classification algorithms in contrast and makes a significant improvement for multi-label classification.


Introduction
In the data mining field, the traditional binary classification or multi-classification problems have been explored substantially. However, the multi-label classification (MLC) problem still exists and it has recently attracted increasing research sights due to its wide range of applications, such as text classification, 1,2 gene function classification, 3 social network analysis, 4 and image/video annotation. 5 Furthermore, with the rapid increase of development and applications with wireless sensor networks (WSNs), massive data collected from a large number of monitoring objects [6][7][8][9][10][11][12] are analyzed, Following are a few examples to illustrate advanced data analysis approaches which are applied to WSN data. With a wireless sensor network system set in a room to collect limb motion data, Guraliuc et al. 9 use the KNN and SVM algorithms to classify limb movements, aiming to develop a method for patient motion therapy. Constructing a sensor device over a bed to collect the sleeping posture data of human body, Barsocchi 10 use KNN and SVM algorithms to classify the sleeping postures for bedsore therapy. Belmannoubi et al. 11 adopted the MLC method to simplify and reduce the complexity of the classification task in order to improve the accuracy of zone location in a multibuilding, multi-floor indoor environment. Zhang et al. 12 applied the MLC method to detect multiple data faults 13 (regarded as multiple labels) simultaneously in sensor networks because it is difficult to build detection model for each fault type.
The traditional single-label classification (SLC) problem considers that one instance belongs only to one category, whereas in the MLC problem, one instance can be allocated to multiple categories simultaneously. Since SLC is merely a special case, MLC deals with a more difficult and general problem in the data mining domain, focusing on the following two challenges. (1) The number of label sets may be very large for test instances (e.g. being exponentially proportional to the total number of labels). 14 For instance, 10 labels would lead to 2 10 possible combinations of label sets for each test instance. (2) Due to multiple labels and possible links among them, their correlations become very complex. 15 For instance, on one hand, it is more likely for a piece of news tagged with ''entertainment'' to have another tag ''sport'' than ''war.'' On the other hand, in a classification of natural scenes with the set of picture labels {''beach,'' ''field,'' ''autumn leaf,'' ''sunset,'' ''mountain,'' ''city''}, it is less possible that a scenery picture is labeled by both ''autumn leaf'' and ''beach.'' Thus, new approaches have been introduced to the MLC algorithms in recent years. For example, considering the application of graph representation to the MLC methods to cope with the above-mentioned problems. However, little progress has been made with these methods and algorithms [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29] especially in the following aspects. (1) Complexity: in order to construct a graph model, a graph-based MLC method must map the instances of an entire training set to graph vertices, with many instances irrelevant to the test instances causing very high requirements for time and space and causing high computational complexity. (2) Heterogeneity: for similarity measurement among instances, nearly none of these methods consider the difference between discrete features and continuous features. A measurement method for continuous features is not always applicable to discrete features, such as non-consecutive values (e.g. gender, occupation). Suppose that occupations are expressed by positive integers f1, 2, . . . , ng, and the corresponding values of instances x 1 , x 2 , and x 3 are 1, 2, and 5, respectively. Theoretically, the similarities between them should be the same, but it is more often than not to consider applying a method for computing continuous similarity such as Euclidean distance, 30 which will cause that the distance between the values of x 1 and x 3 is less (hence more similar) than that of the values of x 1 and x 2 . (3) Uncertainty: when calculating the label sets for the test instances, most of the abovementioned methods use a probability threshold to determine the prediction labels. The subjectivity of selection of the probability threshold unavoidably leads to an improper setting and overfitting for the label sets. 17,23 To overcome these problems, we propose a novel graph-based MLC algorithm, which adopts KNN and random walk algorithms, named multi-label classification based on the random walk graph and the K-nearest neighbor algorithm (MLRWKNN). Here, the random walk algorithm is used to explore the interdependent relationships among the labels through the connectivity between vertices on a graph model. In order to construct a random walk graph, the MLRWKNN algorithm creates a vertex set for certain test instances containing only its KNN training instances, not necessarily the entire training set, and an edge set by adopting the correlations among label sets of vertices samples. For the similarity measurement, the MLRWKNN algorithm attempts to differentiate and integrate the discrete and continuous features of the datasets and improves the similarity computation for the features. To deal with the subjectivity problem of label prediction, the MLRWKNN algorithm estimates the label numbers by computing the probability of the test instances belonging to each label, and then ranks the labels in the label set in a descending order according to their prediction probabilities. Furthermore, considering the possible effects of parameter selection on the classification algorithm performance, the MLRWKNN algorithm also discusses the selection principles and the recommended values for K, jump probability a, and adjustment factor s. The main contributions of this article include the following:

A new construction method for graph model is
proposed to greatly reduce the time and space complexity of random walk algorithm, and hence to be able to handle large-scale data more easily than traditional methods. 2. The similarity measurement method is improved to express the relationships between the instances more accurately through differentiating and integrating discrete and continuous features. 3. A new prediction method is devised for the label set to reduce the subjectivity of the probability threshold method which occurs in most graphbased MLC algorithms. 4. We propose the recommended values of algorithm parameters through experimental analysis instead of subjective empirical constants, which possess a great significance in applying this method to similar problems in different domains.
The rest of this article is organized as follows. The background and reviews of the related work about random walk strategy, the KNN algorithm, and graphbased MLC algorithms are discussed in section ''Related work.'' Based on the previous research work, we propose our approach MLRWKNN in section ''The principle of the MLRWKNN algorithm,'' which consists of three components: design of feature similarity computation, construction of random walk graph, and label set prediction. In section ''Experimental Results and Analysis,'' we discuss experiment process, the test data used, and evaluation criteria. We also present comprehensive experimental results and their analysis. We conclude the article in section ''Conclusion,'' indicating our contributions to this research area and our future work in this direction.

Related work
In this section, we first introduce the random walk strategy and KNN algorithm, which will build up a theoretical foundation for our proposed approach, and then we review the graph-based MLC algorithms.

Random walk and the KNN algorithms
Random walk is an algorithm based on graph representation that iteratively explores the global structure of a network to estimate the proximity between two nodes. As mentioned in the previous section, one challenge of the MLC problem is that there are complex relationships among multiple labels. One solution to this problem is to apply the random walk algorithm to accurately describe the correlations among labels using the connectivity between vertices on a random walk graph. The four input parameters in a random walk algorithm are an adjacent matrix P 2 R r 3 r of the state transition probability (R is the real number set and r is a natural number), an initial probability distribution vector p 2 R r , a jump probability a, and a jumpoccurrence probability distribution vector u 2 R r ; the iterative mode of random walk can be recursively described as p k + 1 = (1 À a)P 3 p k + au (k = 0, 1, 2, :::). Generally speaking, each element P(i, j) in P, 0\i ł r, 0\j ł r, is represented by the similarity between vertices i and j, and P(i, j) indicates that the walk probability is higher for any vertex i if it is more similar to its neighbor vertex j. The algorithm assumes that an initial walk would jump to any vertex at the same probability, which indicates that each element in u has the same value (e:g: 1=r). In addition, the kth element p(k) in the initial vector p is described by the similarity between the start vertex (e:g: x 0 ) and the vertex k. As described in Figure 1, the basic idea behind the random walk is that a walker traverses a graph from one vertex to a series of vertices, and at any vertex, the walker will walk to its neighbor vertex with the probability 1 À a and teleport to any other vertex in the graph with the probability a. 17 A probability distribution vector p k is obtained after the kth walk and the random walk algorithm adopts p k as the input to the (k + 1)th walk and performs the iterations over the graph until it reaches the maximum iteration number or p converges. 17 The KNN algorithm is a lazy learning method which classifies samples according to the idea of ''birds of a feather flock together.'' For certain test instances, it acquires the KNN instances taken from a training set according to a certain similarity measure and votes on the labels of the KNN instances to determine the predicted label for a test instance. As shown in Figure 2, the test instance ''d'' would be classified as ''D'' when  K = 5 according to the voting mechanism. However, it would be classified as ''h'' when K = 10. Obviously, the key problem for the KNN algorithm is how to choose the similarity measurement and K value.

PTMs and AAMs
In general, the MLC algorithms can be classified into two categories: problem transformation methods (PTMs) and algorithm adaptation methods (AAMs). The classical PTMs aim to transform an MLC problem into one or several SLC problems to which existing solutions are available. Here, we brief a few of typical ones. Binary relevance (BR) 4 transforms the MLC problem into a binary classification and generates a separate dataset for each label. Calibrated label ranking (CLR) 31 considers the MLC as a label ranking problem and learns a mapping from instances to rankings over a predefined set of labels. Hierarchy of multilabel classifiers (HOMER) 32 transforms the MLC problem into a tree hierarchy of simpler MLC tasks, in which for each node, labels are split using a balanced clustering algorithm and grouped similar labels into a meta-label, and meta-labels would be predicted by a classifier. Label powerset (LP) 33 attempts to build a single-labeled system, called independent label, for a dataset from possible combinations of labels. RAndom k-labELsets (RAkEL) 34 converts the MLC problem into a multiple classification problem, by ranking the votes of sub-classifiers and taking the most relevant labels as the prediction results with a threshold.
The traditional AAMs intend to enhance the SLC algorithms so that they can handle the MLC problem. For example, the well-known multi-label K-nearest neighbor (MLKNN) 35 extends the KNN algorithm using the maximum a posteriori (MAP) principle to determine the label set for the unseen instances. Using the maximum margin strategy to deal with multi-label data, the classic Rank-SVM 36 optimizes a set of linear classifiers to minimize the empirical ranking loss and can hence handle nonlinear cases with kernel tricks.
Recently, researchers have been focusing on the introduction of graph representation to the MLC algorithms, which can produce more accurate results with probability-based principles, and elegant representation capability of detecting label correlations. The graphbased MLC methods can be put into two categories, one category focusing on the improvement of the existing MLC algorithms by building corresponding graph models for multi-label datasets, and the other category focusing on the solutions to the MLC problem by combining the SLC algorithms with a graph model. Included in the first category are the improved BR algorithms, 15 the improved classification chain (CC) algorithms, 27,28 and the improved MLKNN algorithm. 29 In the following, we describe each of them succinctly. In Cetiner and Akgul, 15 the label independence issue of the BR algorithm is addressed by first assuming the outputs of each binary classifier as observed nodes of a graphical model, and then determining the final label assignments using the standard powerful Bayesian inference for the unobservable nodes. The Neighbor Pair Correlation Chain Classifier (NPC) algorithm 27 constructs a graph of labels based on the latent Dirichlet allocation (LDA) model and acquires the label correlations using the random walk with the restart strategy. Then, the CC algorithm will solve the label chain selection problem by determining the label correlations to establish the best label chain. In Lee et al., 28 a directed acyclic graph (DAG) is constructed to maximize the sum of conditional entropies between all parents and children nodes, the highly correlated labels are sequentially ordered in chains obtained from the DAG, and the predictive power could be maximized by utilizing a CC approach with these chains. The instance-ranking method (IR) 29 maps all the training and test instances into a graph, assigns weights to each training instance with the random walk, and uses these weights to calculate the label prior probability in the MLKNN algorithm.
The following algorithms belong to the second category: the random walk-based MLC algorithms, 16,17,19,20,22,23 the MLC algorithm based on Hilbert-Schmidt independence criterion, 24 and the dictionary learning-based DL-MLC algorithm. 25 Specifically, the graph DL-MLC algorithm 25 maps a training set to a graph model and improves the labelconsistent K-SVD algorithm (LC-KSVD) 37 with adopting the graph Laplacian regularization. The genomescale metabolic model (GSMM) method 24 contains three steps: first, it converts the training and test sets to a weighted undirected graph and describes the graph smoothness through a series of transformations including the adjacency matrix, the real label set of training instances, and the predicted label set of test instances; second, the algorithm describes the consistence of label space with the Hilbert-Schmidt independent criterion; third, it builds the MLC classifier through optimizing the smoothness and consistency. Among the MLC algorithms based on random walk, these three algorithms MLRW, 17 ML-RWR, 23 and RW.KNN 19 map all training instances to graph vertex sets, acquire the probability distribution from test instances to training set with random walk, and compute the predicted probability of each label by the probability distribution and label sets of training instances. Even though the above three algorithms all construct the edge sets on the whole training instances, MLRW connects the edges among instances that have the same labels, while ML-RWR connects the edges among instances which have mutual KNN relations, and still RW.KNN considers only unilateral KNN relations. The transductive multi-label learning (TML) 16 method builds a complex partial directed graph by mapping all the training and test instances to graph vertices. Undirected edges link the test instances that have KNN relations and directed edges link the training instances. Based on this graph, TML improves the method proposed in Azran 38 to tackle the MLC problem. Wang et al. 20 constructed a bi-relational graph via combining a data graph and a label graph. The data graph is constructed by all the training and test instances and the random walk with restart strategy is used to compute transition probability from each label vertex to the instance vertex, that is, the label predicted probability.
In summary, for the graph construction problems all the afore-mentioned algorithms use whole training instances (even all the training and test instances) to construct graph vertices, which leads to the embedding of too many vertices unrelated to test instances. Huge graph vertices require considerable computation power in time and space, and sometimes even deteriorate the classification effect. 17 For the similarity measurement problem, the above algorithms adopt the distance reciprocal between feature vectors, [17][18][19]29 the distancebased Gaussian function method, 16,20,23,24 and the sparse rule operator method. 22 None of these methods consider the difference between discrete features and continuous features. For the parameter selection problem, almost all the algorithms adopt empirical constants and do not implement in-depth experimental analysis on how to choose algorithm parameters. In order to overcome these shortcomings, we propose a novel MLC algorithm based on random walk strategy and KNN algorithm. Section ''The principle of the MLRWKNN algorithm'' provides a detailed description of the proposed method.

The principle of the MLRWKNN algorithm
In order to outline the MLRWKNN algorithm proposed in this article, some basic conceptions of MLC are discussed here. Suppose that (x, y) represent a multilabel sample where x is an instance and y L is its corresponding label set. The total label set L is defined as follows L = fl 1 , l 2 , . . . , l Q g, Q is the total number of labels ð1Þ Suppose that x = (x 1 , x 2 , . . . , x D ) 2 X is a D-dimensional feature vector corresponding to x, where X R D is the feature vector space and x d , d = 1, 2, . . . , D, denotes a specific feature, and y = (y 1 , y 2 , Á Á Á , y Q ) 2 f0, 1g Q is the Q-dimensional label vector corresponding to y, and y q is described as Therefore, the multi-label classifier h can be defined as Suppose there are m samples in training set X train and n samples in test set X test , they are defined as follows. Let x 0 2 X test denote a certain test instance Overall description of the MLRWKNN algorithm The MLRWKNN algorithm aims to solve the MLC problem through constructing the random walk graphs, where the vertex and edge sets are generated by a certain test instance (e:g: x 0 ) and its KNN instances (named N K x 0 ) in X train . In detail, the MLRWKNN algorithm consists of the following three steps: 1. For each test instance in X test , the MLRWKNN algorithm constructs the random walk graphs for each label. For example, the graph of x 0 on l q can be constructed as follows: the MLRWKNN algorithm maps x 0 and its KNN instances (K = 4 in Figure 3) in X train to a vertex set; the instances in KNN would be connected by undirected edges if they have the same labels, and x 0 would be connected by undirected edges with the instances which have the label l q . Finally, Q graphs (named G q x 0 , q = 1, 2, . . . , Q) are constructed for x 0 , as described in Figure 3. The detailed information of the graph construction is described in section ''A new construction method of the random walk graph'' (equations 8-12). 2. Through random walk operation, the MLRWKNN algorithm computes the probability distribution on the graph vertices. As shown in Figure 3, starting from x 0 on all the above graphs, a traveler will walk to its neighbor vertex at the probability 1 À a and teleport to any other vertex at the probability a, and insofar Q stable probability distribution vectors are obtained. Through computing the prior probability for each label as the weight to each Q vector, the MLRWKNN algorithm produces a summation of the weighted Q vectors, which is considered the final probability distribution vector. Equations 13-18 in section ''A new construction method of the random walk graph'' describe the procedure of random walk and the generation of probability distribution vector. 3. Through using the above obtained probability distribution vector, the MLRWKNN algorithm predicts the probability of each label belonging to x 0 and sorts all these probabilities in a descending order, and then the number of the labels of x 0 is calculated, and finally the predicted labels are generated by choosing higher probability labels. Section ''Label Sets Prediction'' gives detailed information about the prediction of label set.
In order to elaborate the MLRWKNN algorithm at a great detail, we will discuss the design of similarity measurement, the construction of random walk graph, and the prediction of label set in the following sections. Similarity measurement, as the basis of KNN algorithm, is discussed in section ''Design of proposed similarity,'' and sections ''A new construction method of the random walk graph'' and ''Label sets prediction'' give the detailed description of the above three steps.

Design of proposed similarity
Considering the difference between discrete features and continuous features, our MLRWKNN calculates the similarities for discrete features and continuous features, respectively, and then combines them with a linear weighted process. Specifically, given an instance , their similarity based on discrete features is defined as follows In order to determine the value range of similarity based on a continuous feature, MLRWKNN adopts the Gaussian kernel function to handle the similarity of continuous features ) represents the similarity of continuous features (such as Euclidean distance), and s is the spread factor in the Gaussian kernel function. 39 Refer to Section ''Experimental analysis of the selection of the s value'' for a detailed definition and explanation of s. The final similarity between x i and x j is defined as A new construction method of the random walk graph In general, only a few training instances play a decisive role in predicting the label sets of test instances, whereas the other training instances not only complicate the random walk graph and interfere with the classification results. In this article, we propose a novel construction method of the random walk graph by adopting KNN instances of x 0 in X train , and the KNN instances N K x 0 is defined as follows N K x 0 = xjx belongs to the Knearest neighbors of x 0 in X train f g ð8Þ Figure 3. The construction of random walk graphs.
Let G q x 0 represent the KNN graph of x 0 , based on l q . G q x 0 is defined as follows where 1. The vertex set of instances V x 0 is defined as 2. v 0 represents the vertex of x 0 ; 3. The edge set fe 0i j9x 2 N K x 0 , s:t:l q 2 yg links v 0 with other vertices in V x 0 ; 4. E x 0 , the edge set among vertices in V x 0 , is defined as Since there is no need to construct the KNN for every training instance in X train , MLRWKNN avoids many sort operations (the KNN for each training instance needs to sort m training instances). Based on G q x 0 , the iterative mode of random walks is described as The terms in this formula are explained as follows: (1) p k q 2 R K + 1 (k = 0, 1, 2, . . . ) is the probability distribution vector after k times of random walks, whose ith element p k q (i) represents the probability of the (i21)th vertex of G q x 0 , and (2) P q x 0 2 R (K + 1) 3 (K + 1) is the adjacent matrix of the state transition probability of G q x 0 , whose element P q x 0 (i, j), the probability of a random walk from v i to v j , is defined as follows In equation 13, a 2 ½0, 1 is a constant and u 2 R K + 1 is the jump probability vector. Assuming a walk jumps from a vertex to other vertices at the same probability (1=(K + 1)) when starting from any arbitrary vertices, we define u as follows where I K + 1 is the (K + 1)-dimension constant vector, each of whose elements has the same value 1. Note that for equation 13, the convergent probability distribution vector p Ã q must satisfy When the random walk procedure ends, the MLRWKNN algorithm generates Q stable probability distribution vectors and the final probability distribution vector is defined as where p L 2 R Q , whose qth element p L (q) represents the prior probability of the qth label, is defined as Note that the symbol (*) is an ordinary multiplication operator. In summary, for a given random walk graph, the MLRWKNN algorithm obtains the probability distribution from v i to other vertices by one random walk operation, and then uses the above probabilities as the starting probabilities for each vertex. The algorithm repeats the above process to obtain new probabilities until it reaches a given number of walk rounds or the probability distribution which remains unchanged. The last probabilities are regarded as the final walk probabilities from v i to other vertices.

Label sets prediction
The MLRWKNN algorithm predicts the cardinal number len(y 0 0 ) of a label set for a test instance x 0 with the probabilities that x 0 belongs to each label, sorts these probabilities in a descending order, and selects the first len(y 0 0 ) labels to form the predicted label set of x 0 . Suppose p x 0 2 R Q is the predicted probability vector for which x 0 belongs to each label. The qth element p x 0 (q), representing the probability that x 0 belongs to l q , is defined as where y 0 k is the label sets of the training instances corresponding to v k and d(l q 2 y 0 k ) = where rank p x 0 (q) is the descending order of p x 0 (q) of p x 0 and len(y 0 0 ) is defined as follows where r represents the largest integer that is not greater than the real number r.
In summary, first, the MLRWKNN algorithm computes N K x 0 according to the proposed similarity measure and constructs G q x 0 for the instance x 0 . Then, random walks are executed on G q x 0 until a convergence probability distribution is obtained. Finally, the label set is predicted by calculating the length of the label set. The MLRWKNN algorithm is shown in Algorithm 1.

Convergence proof of MLRWKNN algorithm
We maintain this statement that a simple random walk on a graph is a discrete-time Markov chain over the nodes (i.e. vertexes). 40 Theorem 1. The MLRWKNN algorithm is convergent. Proof: 1. Because vector u does not contain zero elements and 0 \ a \ 1, it is possible that the MLRWKNN algorithm can randomly move to any vertex in G q x 0 , starting from any vertex. Therefore, the adjacent matrix P q x 0 is irreducible.

A walk from any vertex can in principle return
to any given vertex, including the vertex itself, in consecutive steps due to the existence of strictly positive vector u. Therefore, the whole random walk procedure is aperiodic.
3. Obviously, it is possible to traverse a vertex again in a positively recurrent number of walk steps after it is traversed for the first time since being a Markov chain; random walks possess the feature of finiteness. 4. From the above three points, it can be observed that the MLRWKNN algorithm is ergodic, and hence convergent. 17 That is, there exists a vector p Ã q which satisfies equation 16.

Experimental results and analysis
In this section, we first introduce experimental environment and datasets; then discuss the experimental analysis of the construction of the random walk graph, the similarity measure method, and the parameter selection of the MLRWKNN algorithm; and finally make a comprehensive experimental comparison of the seven algorithms.

Experiment environment and datasets
Six datasets, which are extensively used by researchers to conduct experiments and evaluate the MLC algorithms, include Flags, Genbase, Medical, Scene, Yeast, and Mediamill (for detailed information about these public datasets, refer to the URL: http://mulan.source forge.net/datasets-mlc.html). They cover different application domains such as texts, images, biologic data, and video data, and the number of labels varies from 6 to 101 as described in repetitions were considered as the final values for every evaluation criterion.

Evaluation criteria
So far, a series of criteria have been introduced to evaluate the MLC algorithms from different perspectives, and these measurement criteria can be divided into two categories: bipartition-based criteria and ranking-based criteria. 14,41 The former concentrates on whether a label is correctly predicted, while the latter on whether a relevant label is ranked before an irrelevant label. In general, no algorithm can produce a best performance for all the criteria. Appropriate criteria should be set up in accordance with the optimal objective for a proposed method. As our MLRWKNN algorithm focuses on the ranking problem, four ranking-based criteria have been determined to validate our method: Ranking Loss (RL), One Error (OE), Coverage (Cove), and Average Precision (AP). Let f denote the function of a predicted probability and y 0 represent the predicted label set of a test instance x 0 . The predicted probabilities for which x 0 belongs to each label are sorted in a descending order and rank f (x i , l) represents the corresponding ranking of the label l. Let y i be the complement of y i in L. RL computes the average number of times when irrelevant labels are ranked before the relevant labels OE calculates the average number of times that the top-ranked label is irrelevant to the test instance Coverage Cove reckons the average number of steps that in the ranked list to find all the relevant labels of the test instance AP evaluates the degree that labels before the relevant labels are still relevant labels Results and analysis Analysis on the proposed similarity measurement. Most of the graph-based MLC algorithms use the similarity measurement methods for the continuity features (e.g. Euclidean distance) to compute instance distances, which may be unsuitable for the discrete features. Based on the MLRWKNN algorithm, the proposed similarity is compared with the Gaussian kernel of the Euclidean distance method 16,20,23,24 and the reciprocal of Euclidean distance. [17][18][19]29 Flags and Genbase datasets were chosen for a comparison study. As described in Table 1, the Genbase dataset only has the feature of discreteness whereas the Flags dataset has both the features of discreteness and continuity.  Table 3 show that the proposed similarity can reflect the distances between instances more accurately and it can improve the classification affects greatly.
Analysis of the proposed random walk graph. In order to illustrate the advantages of the MLRWKNN algorithm, we choose the ML-RWR 23 algorithm which improves the MLRW 17 algorithm and adopts the KNN idea. We use the same datasets as ML-RWR, that is, the Medical and Yeast datasets, for the performance comparison, and the experimental results are described in Table 4. As shown in Table 4, MLRWKNN produced almost the same classification performance as the ML-RWR algorithm. However, compared with the proposed MLRWKNN algorithm, ML-RWR constructs random walk graphs on the whole training set, which leads to the large space and time requirements. Assume the training set of the ML-RWR algorithm is X train . There are m + 1 vertices in its random walk graph, and edge set construction needs m + 1 times of sorting operations, but in the MLRWKNN algorithm, the corresponding vertex number is K + 1(K\\m) and only one sorting operation is executed on X train for each label, see section ''A new construction method of the random walk graph.'' Analysis of parameters selection. There are three parameters, K (see equation 6), adjustment factor s(see equation 8), and jump probability a (see equation 13), that we need to select for the MLRWKNN algorithm. A suitable value is selected for these parameters is crucial since either too large or too small of the value to be selected will cause problems for the algorithm. Too small a value is selected for the parameter K will lead to the problem of over-fitting. On the contrary, too large K will cause the instances' influence related to x 0 to decrease for label prediction. This situation will not only increase the computation complexity in time and space but also cause erroneous classification results. 42 For the parameter s, too small of it will lead to overfitting and too large will produce a failure of instances classification. 39 For the parameter a, too small of it will cause the label prediction to be over-sensitive to the change in the state transition matrix while too large will cause the convergence speed of the random walk to decrease. 19 In order to determine an accurate value for the above parameters, in the following, we use the Scene dataset as sample data to illustrate the impact of the parameter determination on the classification performance of the MLRWKNN algorithm. The optimal settings are also provided.
Experimental analysis of the selection of the parameter K. Different from adopting fixed constant values, we dynamically determine an optimal K value for different datasets during the model training procedure. Specifically, the gradual refinement method was adopted to determine the K value. First, different K values divide the training set into n parts (for instance, n = 100) and K = 1 + (m À 1=n À 1)(KNum À 1), where KNum represents the division span and KNum = 1, 2, . . . , n. Then, an approximate interval that contains the best K value is chosen. By repeating the above process, the best K value is confirmed.
From Figure 4, for the Scene dataset, it can be observed that a gradual increase of the K value improves the experiment results in terms of the four evaluation criteria, that is, RL, OE, Coverage, and AP. If K continues to increase to the optimal value, the evaluation results will become worse or remain stable, which indicates that an appropriate K values can be reached for a best algorithm performance. It is noted that for different measurement metrics, the optimal K values are different. We select an optimal K value through synthetizing all the measurement metrics in the model training procedure.
Experimental analysis of the selection of the s value. For convenience, we determine an optimal s in terms of the parameter u in the interval of [0, 1], s = min similarity(x i , x j )). Suppose that we take n groups of data in ½0, 1, then u = (uNum À 1)=n, where uNum represents the division span, taking a value from (1, 2, . . . , n). Figure 5 shows the classification performance of the MLRWKNN algorithm with different values of s on the Scene dataset. As shown in Figure 5, when u increases, the classification performance gradually decreases. After reaching its lowest performance value, the classification performance will gradually improve as u continues to increase. This indicates that, in general, an optimal s value can be determined when it is close to a minimum or maximum distance between two training instances. The experiments on the other five datasets show the same results.
Experiment analysis of the selection of the a value. We assign n (e.g. n = 101) different values to a 2 ½0, 1 in order to test and determine a most suitable one for selection, and let a = (AlphaNum À 1)=n, where AlphaNum = 1, 2, . . . , n represents the division span.
The classification performance of MLRWKNN is shown in Figure 6 with different a values on the Scene dataset.  From Figure 6, it can be observed that different a values have little effect on the classification results. The experiments on the other five datasets produced the same results. However, the curves of each subgraph jump at both ends. This occurs because when random walks only transfer between vertices with edges, the walks only randomly jump between vertices when a = 0.
Briefly, from the above analysis, we can observe that an optimal K value for the algorithm is dynamically determined during the model training procedure. An optimal s value is generally selected near a minimum or maximum distance between two training instances. The values of a have little influence on the classification results, so in our experiments, we set to 0.15, as in Wang et al. 18 Comparison of experimental results of seven algorithms. We select seven most cited MLC algorithms, that is, BRKNN, CLR, HOMER, LP, RAkEL, MLKNN, and Rank-SVM, for a comparison study in contrast to the proposed MLRWKNN algorithm. As the MLRW algorithm adopts the continuous feature-based similarity measurement and uses m for the parameter K, the discussions in section ''Analysis on the proposed similarity measurement'' to section ''Analysis of parameters selection'' have indicated that it is only a special case of MLRWKNN under the non-optimal graph construction mode, the non-optimal similarity measurement, and the non-optimal parameters selection, so there is no need to include the MLRW in this comparison study.
The experimental datasets are described in Table 1, and the optimal values of the parameter K on different datasets are described in Table 5. The parameter s adopts the maximum distance between the training instances, which is 0:15. The experimental results are shown in Tables 6-11, with each dataset in a table (note that the numbers in the parentheses represent the ranking of the eight algorithms under the corresponding criteria).
In summary, as observed from Tables 6-11, the LP algorithm demonstrates a poor performance on all the six datasets among these eight algorithms. The other six algorithms show a low performance on some datasets. More specifically, the HOMER algorithm      (7) 0.3984 (7) 1.1550 (7) 0.7308 (7)

Conclusion
The MLC problem is an important, significant, and widely influencing research problem in the field of data mining, impacting a wide range of applications in real world domains. The graph-based MLC algorithms have, in recent years, received an increasing research attention and introducing the random walk-based methods into the solutions to the MLC problem has also become a hot research topic. In this article, we propose a novel approach, the MLRWKNN algorithm, and research direction to tackle this problem through integrating the random walk methods in the graph-based MLC algorithms. Our major contributions to the field of MLC in this article include the following three aspects. First, a new paradigm is proposed for a graph model that constructs a vertex set via the KNN training instances for certain test instances and determines the label correlations of the vertices samples to build the edge set. This new paradigm can reduce the overhead complexity in time and space compared with other graph-based models. Second, a novel label set prediction method is developed by introducing the similarity measurement. This method overcomes the problem of subjectivity of determining thresholds in the traditional methods and performs the similarity computation more accurately via differentiating and integrating discrete and continuous features. Third, we consider the influences of the parameter selection and determination on the classification performance and suggest the guidelines of the selection principles and recommended values for the algorithm parameters. A good number of experiments have been conducted and significant comparisons have been made on the six datasets among the seven algorithms and ours, in order to evaluate the proposed similarity measurement, the new construction method for graph model, and the MLRWKNN algorithm. The experiment results demonstrate that the proposed MLRWKNN algorithm produce a much better result than the other seven state-of-the-art MLC algorithms.
Future work: there are efforts required to improve and extend the proposed method and algorithm. First, in this work, we considered only a linear integration of the similarity metric for the two continuous and discrete datasets. To explore a further theoretical similarity computation, it will be interesting and useful to investigate a nonlinear combination style for the proposed similarity measurement which takes the dataset features of continuity and discreteness into account. Second, it is a challenge to tackle the issues of an adaptive adjustment mechanism for the determination of algorithm parameters, in order to enhance the accuracy and automation of the MLC methods and tools. Third, deep learning approaches 43 have been recently widely explored and considered to have a strong impact on various application domains. Integrating deep learning or rule learning methods in the MLC methods 44,45 is attracting researchers and some interesting results have been observed. We believe this is another significant direction for our future work. Fourth, a newly published paper proposed a semi-supervised learning (SSL) method based on random walk, 40 which leverages the notation of ''landing probabilities'' of class-specific random walks and aims at a great improvement in computational complexity and scalability for large graphs. This article highlights another way of investigation of MLC problems. As a part of our next step of MLRWKNN, we consider contrasting this method to our method on KNN and random walk and making comparison study of algorithm performance through experiments with the datasets used in Berberidis et al. 40

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by National Science