A Distributed Weighted Possibilistic c-Means Algorithm for Clustering Incomplete Big Sensor Data

Possibilistic c-means clustering algorithm (PCM) has emerged as an important technique for pattern recognition and data analysis. Owning to the existence of many missing values, PCM is difficult to produce a good clustering result in real time. The paper proposes a distributed weighted possibillistic c-means clustering algorithm (DWPCM), which works in three steps. First the paper applies the partial distance strategy to PCM (PDPCM) for calculating the distance between any two objects in the incomplete data set. Further, a weighted PDPCM algorithm (WPCM) is designed to reduce the corruption of missing values by assigning low weight values to incomplete data objects. Finally, to improve the cluster speed of WPCM, the cloud computing technology is used to optimize the WPCM algorithm by designing the distributed weighted possibilistic c-means clustering algorithm (DWPCM) based on MapReduce. The experimental results demonstrate that the proposed algorithms can produce an appropriate partition efficiently for incomplete big sensor data.


Introduction
Recent years have witnessed the deployments of sensor networks for many critical applications such as Internet of Things (IoT) [1,2], environment monitoring [3], and target detection [4,5].With the rapid advent of various sensor networks, more and more mobile devices, such as smart phones and RFID, can sense and collect a huge number of sensor data.For example, data sampled from millions of sensors deployed sensor networks for environment monitoring can exceed hundreds of terabytes data every day.We are moving toward the era of big sensor data which requires efficiently advanced data analysis and mining tools [6].
As an important data mining tool, the possibilistic cmeans cluster algorithm (PCM) has emerged as an important technique for pattern recognition and data analysis, proposed by Krishnapuram and Keller [7].Many PCM variants have been proposed after standard PCM for improving the performance of the original PCM algorithm.Zhang and Leung applied the fuzzy method to PCM for an improved PCM algorithm, which enhances the robustness of the possibilistic approach [8].In 2011, Yang and Lai proposed a robust merging approach and created the automatic merging possibilistic clustering method to obtain the number of classes automatically.However, this algorithm could not always obtain the most reasonable number of clusters [9].Pal proposed FPCM and PFCM, which is a combined way of doing PCM with FCM, avoiding the coincidental cluster [10,11].In 2008, Xie et al. proposed an enhanced possibilistic c-means clustering algorithm, which partitioned the dataset into the main cluster and the assisted cluster to avoid producing the coincidental cluster [12].
PCM can be extensively used in big sensor data analysis and mining.However, in big sensor data, many data sets suffer from incompleteness; that is, a data set X can contain vectors that miss one or more of the attribute values [13].PCM could not succeed completely in clustering such incomplete data sets in real time.On the one hand, PCM could not calculate the distance between two objects in incomplete data sets.Meanwhile the accuracy of PCM is easily corrupted by incomplete objects.On the other hand, PCM is difficult to satisfy the real-time requirement of clustering International Journal of Distributed Sensor Networks incomplete big sensor data due to the huge amount of data.
The paper proposes a distributed weighted possibilistic c-means algorithm (DWPCM) for clustering incomplete big sensor data.First, the paper applies the partial distance strategy (PDS) [14] to PCM (PDPCM) for calculating the distance between two objects in incomplete data set using all available attribute values.Further, a weighted PDPCM algorithm (WPCM) is designed to reduce the corruption of missing values.In WPCM, a novel method for determining weight values is presented.Finally, the cloud computing technology is used to optimize the WPCM algorithm by designing the distributed weighted possibilistic c-means algorithm (DWPCM) based on MapReduce, which aims at improving the cluster speed of WPCM.
To evaluate the performance of the proposed algorithms, we implement the proposed methods on two real big sensor data sets.The experimental results demonstrate that the proposed algorithms can produce an appropriate partition efficiently for incomplete big sensor data.
The rest of this paper is structured as follows.Section 2 resumes some related research on possibilistic c-means clustering algorithms and the cluster algorithms based on cloud computing.Section 3 depicts the WPCM algorithm and Section 4 describes the DWPCM algorithm based on MapReduce.The performance evaluation is illustrated in Section 5.The paper is concluded in Section 6.

Possibilistic c-Means Algorithms.
PCM partitions an mdimensional dataset  = { 1 ,  2 , . . .,   } into several clusters to describe an underlying structure within the data.A possibilistic partition is defined as a  ×  matrix  = {  }, where   is the membership value of object   towards the th cluster,  is the number of clusters, and  is the number of objects.PCM obeys the constraint as follows: PCM minimizes the following objective function [7]: In (2),  = (V 1 , V 2 , . . ., V  ) is a C-tuple of prototypes and  = [  ] is a  *  matrix, called the possibilistic c-partition matrix, satisfying the conditions in (1).Here,  > 1 is a fuzzification constant and   is a suitable positive number.The first term demands that the distances from the object to the prototypes be as low as possible, whereas the second term forces the   to be as large as possible, aiming at avoiding the trivial solution.
Solving the minimization problem (2) yields membership functions of the form The cluster centers are updated using The procedure of PCM can be described as follows.
PCM is rational and it has a meaningful interpretation.What is more important is that PCM is robust, because a noisy object would belong to the clusters with small memberships, and consequently it cannot corrupt the resulting clusters significantly [8].
In the PCM algorithm, each object is considered equally important in the clustering solution.The weighted PCM (wPCM) model introduces weights to define the relative importance of each object in the clustering solution [15,16].Assume that w ∈   ,   ≥ 0, is a set of weight values that define the influence of each object, which results in the WPCM objective function of the form In order to find clusters with nonspherical shapes, kernelbased PCM algorithms are proposed, which mapped original data into higher dimensional feature space [17].Given a kernel function , the kernel PCM algorithm (kPCM) minimizes the following objective function: where Φ  is the kernel-based distance between the th object and the th object.
Even though these methods are performed well in the cluster process, they cannot succeed completely in clustering incomplete big sensor data because of the corruption of missing values.The paper proposed a weighted PCM algorithm (WPCM), which applies the partial distance to PCM for calculating the distance between two objects in incomplete data set and then assigns low weight values to incomplete data object for reducing the corruption of missing values.

Cluster Algorithms Based on Cloud
Computing.Cloud computing has emerged as a significant technology to deal with big data in time by leveraging vast amounts of computing resources available on demand with low resource usage cost [18,19].As the key technology of cloud computing, MapReduce [20] is a highly efficient distributed programming model for large-scale data sets in parallel computing.
Many cluster algorithms based on MapReduce have been proposed to improve the cluster efficiency.For example, Zhao [21] proposed parallel k-means algorithm based on MapReduce to cluster massive data fast.Similar algorithms include MapReduce-kCenter algorithm and MapReduce-kMedian algorithm, which used iterative sampling technology to get good performance for clustering very large data [22,23].Another example is hierarchical clustering algorithm based on cloud computing, in which MapReduce is used to optimize the hierarchical clustering algorithm for processing large-scale data [24,25].Well-known cluster algorithms based on MapReduce, such as HDBSCAN [26] and MR-DBSCAN [27], reduce I/O access frequency and spatial complexity for speeding up the density clustering.For affinity propagation clustering algorithm, published in science magazine proposed by Frey and Dueck [28], Lu et al. [29] proposed a distributed AP clustering algorithm based on MapReduce (DisAP), which includes three MapReduce stages and achieved high performance on both scalability and accuracy.In 2012, Yang et al. [30] presented a MapReducebased MST text clustering algorithm, which used cloud computing technology to improve the performance of the graph clustering.Yu and Dai [31] proposed a parallel fuzzy c-means algorithm based on MapReduce for improving the standard fuzzy c-means algorithm.Other parallel clustering algorithms based on cloud computing can be found in [32][33][34].
Even though these algorithms perform their job well, they all focus on crisp clustering.The paper uses cloud computing technology to accelerate the cluster speed of WPCM by designing a distributed WPCM algorithm (DWPCM) based on MapReduce to produce the possibilistic clustering for big data in real time.

PCM Based on Partial Distance for Clustering Incomplete
Data.In this subsection, the paper applies partial distance to PCM for clustering incomplete data set.Partial distance is used to calculate the distance between an object   and the th cluster center V  as follows: From (8a) and (8b), partial distance makes full use of attribute information of both complete data and incomplete data to calculate the distance between two objects.
The PCM algorithm based on partial distance (PDPCM) is obtained by making two modifications of PCM: (1) calculating   for incomplete data according to (8a) and (8b) and (2) updating the cluster centers with This algorithm enjoys all the standard convergence properties of PCM because it is an instance of alternating optimization [7].
For an -dimensional incomplete data set, the procedure of the PCM algorithm for clustering incomplete data based on partial distance can be described as follows.

Weighted PCM Algorithm.
Even though the PCM algorithm based on partial distance given earlier can cluster incomplete data sets, it is difficult to produce a good partition due to the effect of incomplete objects.The paper proposes a weighted PCM algorithm (WPCM), which assigns low weight values to incomplete objects for reducing the corruption of incomplete objects on the cluster process.
To minimize the objective function, WPCM updates the membership values and clusters centers using the following equation: The   weights can be viewed as strength terms describing the strength of membership of   in any cluster [16].Recently, a large number of algorithms have been proposed to determine the weight of the object.Perhaps the most representative method for detecting the outlier and reducing their effect on the cluster process in PCM was proposed by Schneider [16], which determines the weight of each object depending on the degree of belonging of each feature vector in any cluster.The equation for the weight values is where  > 0 is a suitably chosen constant.Equation ( 13) can detect the outlier effectively.However, it cannot reduce the effect of incomplete objects on the cluster process.The paper redefines the weights in where m is the number of the features of the data object,  represents the iterative times, and   is the number of missing feature values of the data object   .Equation ( 14) reduces the corruption of incomplete objects on the cluster process by the coefficient (1 −   /).A large   value of   , which indicates that   has many missing feature values, will result in a low weight value.In ( 14),  increases with the times of the cluster iteration increasing, which can accelerate the convergence of the clustering process.The steps of the WPCM algorithm are outlined as follows.

Time Complexity.
In this subsection, we discuss the time complexity of WPCM.All operations are counted as unit costs.We do not assume time economies that might be realized by special programming tricks or properties of the equations involved.We use the following notation in our discussion: is the number of iterations of WPCM over the full data set;  is the number of data objects; is the number of dimensions; is the number of clusters.
From the cluster process of the WPCM algorithm, the time complexity of this algorithm is dominated by Steps 4 and 5.The calculation of membership matrix using (14) requires () operations per cluster, which is also required by updating the cluster centers, resulting in a total time complexity of () for WPCM.

Distributed Weighted PCM Algorithm Based on MapReduce
In Section 3, we propose a weighted PCM algorithm based on partial distance, which can cluster incomplete data effectively.However, WPCM has a low efficiency for clustering incomplete big sensor data owning to the huge number of data and its high time complexity.To accelerate the cluster speed of WPCM, the paper proposes a distributed WPCM algorithm (DWPCM) based on MapReduce in this section.
From the steps of WPCM, there are two major operations: calculating the degree of membership   and calculating the clustering centers V  .
In the map phase, the Map function is designed to calculate the degree of membership   .
To reduce the communication cost of the distributed algorithm, the paper partitions the membership matrix into  blocks by columns, each block with / columns, where  is the number of the data objects and  is the number of the data nodes in a cloud platform.After partition, the paper puts each block of the membership matrix in a data node to calculate.
In order to update cluster centers in parallel, two parameters,  ()   and  ()  , are introduced, where  represents the serial number of data node.After calculating the membership   , the Map function calculates  ()   and  ()  using Finally, the Map function outputs  ⟨  , V  ⟩, where  is the number of classes,   represents the identifier of the class, and V  is a vector that consists of  ()    and  ()   .In the reduce phase, the Reduce function is designed to calculate the clustering centers V  .
The input of the Reduce function is a ⟨  , ⟩, where   is the identifier of the class and  includes all of value's with the same   obtained from the Map function.The Reduce function is responsible to compute the cluster centers according to where  is the number of the data nodes and  is the identifier of the class, which have the same explanation with   .Similar to Section 3.3, we discuss the time complexity of DWPCM now.The notation used in our discussion is the same as that in Section 3.3.The time complexity of DWPCM is approximately (/), where  is the number of data nodes in the cloud computing platform.Note that DWPCM sends the clusters and the weight values to all the data nodes in each iteration, which may cause communication overhead.However, the communication takes significantly less time than the calculation on the cluster process, especially in a centralized cloud computing platform where all cloud computing nodes are in the same location.So we ignore the communication overhead when estimating time complexity.

Experiments
In order to evaluate the efficiency and effectiveness of the proposed algorithms, we perform the algorithms on the two real data sets.The experimental setup and dataset are described first, followed by the results.

Experimental Setup and Dataset.
The experimental environment consists of 20 computers as cloud computing nodes, each of which has an Intel Core i7 processor with 3.2 GHz speed, 8 GB RAM, and a 2 TB hard drive.
We apply the algorithms on two real data sets in the experiments.The first data set is an extension of "Gas Sensor Array Drift Dataset at Different Concentrations Data Set" which is available in UCI Machine Learning Repository [35,36].This data set contains 3 × 10 9 objects with 129 numerical attributes, called eGSAD.The second data set consists of 2 × 10 10 objects sampled from the smart WSN lab, called sWSN, whose size is about 1 TB.The parameters used for the proposed algorithms are  = 10 −3 and  = 1.75, which can help to produce a good cluster result [37].We first artificially create missing values in the data sets for simulating incomplete data sets and then cluster them using the proposed algorithms.The performance of the proposed algorithms is evaluated by comparing their cluster results to the result of PCM for clustering original data sets from both the efficiency and effectiveness points of view.While the efficiency concerns the time required to cluster the data sets, the effectiveness is related to the cluster accuracy.
Since the cluster performance depends on the amount of missing values, we artificially create six kinds of missing ratios, which are 1%, 3%, 5%, 10%, 15%, and 20% objects with missing values.For every missing ratio, we generate 5 different incomplete data sets for Gisette and sWSN.Specifically, any two data sets can have different missing values for every missing ratio.

Experimental Results on Cluster Accuracy.
In order to assess the effectiveness of WPCM, two well-known evaluation criteria,  * and Adjusted Rand Index (ARI), are used in the experiment [37,38].
International Journal of Distributed Sensor Networks   The first evaluation criterion,  * , is used to assess the error between ideal cluster centers and cluster centers produced by a specific algorithm, which is calculated according to where V  ideal represents the th ideal cluster center and V  * denotes the th cluster center produced by a specific algorithm * .A lower value  * of indicates that the algorithm produces more accurate cluster centers.The other evaluation criterion, ARI(,   ), is used to measure the agreement between two possibilistic partitions of a set of objects, where  represents the ground truth labels for the objects in the data set and   denotes a partition produced by a specific algorithm.A higher value of ARI(,   ) indicates that the algorithm produces a better cluster result.
To eliminate the variation in the results from trial to trial, Tables 1 and 2 present the average values of  * and ARI obtained over 10 trials on incomplete eGASD and sWSN data sets, and the same incomplete data set is used in each trail for each of the three approaches so that the results can be correctly compared.
We present the cluster accuracy of PCM, PCM based on partial distance (PDPCM), WPCM, and DWPCM on the two data sets in terms of  * for 6 missing ratios in Tables 1 and 2.
From Tables 1 and 2, as the missing ratio increases, the average values of  * of three algorithms increase, which argues that the cluster accuracy is affected by missing ratios.In terms of  * , WPCM always performs better than PDPCM because the average  * value of WPCM is lower than that of PDPCM for 6 missing ratios, which demonstrates that the cluster prototypes obtained by WPCM are closer to the actual ones.DWPCM produces the same result as WPCM based on  * because the two algorithms use the same methods to calculate the weight values, the cluster prototypes, and the membership matrix.
To calculate the ARI, we first harden the possibilistic partitions by setting the maximum element in each column of  to 1 and all else to 0. Tables 3 and 4 show the average values of ARI(,   ) obtained by PDPCM, WPCM, and DWPCM.
From the results shown in Tables 3 and 4, WPCM produces better partitions than PDPCM for 6 different missing ratios of the two data sets in terms of ARI.DWPCM still produces the same result as WPCM based on ARI.

Experimental Results on Execution
Time.We use execution time and scalability to evaluate the efficiency of DWPCM.The average execution time of PDPCM, WPCM, and DWPCM on the two data sets for different number of objects is shown in Figures 1 and 2.
From Figures 1 and 2, even though the average execution time of the three algorithms increases with the number of objects increasing, DWPCM takes the least time of the three algorithms for the two data sets.Especially when the data set is very big, the execution time required by DKPCM is significantly less than the other two algorithms, which demonstrates that DWPCM performs most efficiently for clustering big sensor data.
In order to test the scalability of DWPCM, we perform the algorithm for clustering the two data sets in the different cloud computing platforms, in which there are 5 nodes, 10 nodes, 15 nodes, and 20 nodes, respectively.The result is shown in Figure 3.
From Figure 3, the execution time of DWPCM for clustering the two data sets reduces gradually with the increasing number of data nodes in the cloud computing platform, which demonstrates that adding nodes can significantly improve the system capacity.Therefore, DWPCM has a good scalability, especially when the data set is big.

Conclusion and Future Work
The paper proposes a distributed weighted PCM algorithm for clustering incomplete big sensor data.The proposed algorithm applies partial distance strategy to PCM (PDPCM) for calculating the distance between any two objects in the incomplete data set.Further, based on PDPCM, the paper designs a weighted PCM algorithm (WPCM) to reduce the corruption of incomplete objects on the cluster process.Another unique property of the proposed algorithm is the use of the cloud computing technology.Cloud computing is used to optimize WPCM to provide a significant computation speed, which is very important for big sensor data real-time clustering and analysis.
The experiments demonstrate that WPCM produces a good cluster result based on both evaluation criteria, namely,  * and ARI.As for efficiency, DWPCM performs better than WPCM, which takes significantly less time than WPCM for clustering incomplete big sensor data.Note that DWPCM produces the same cluster result as WPCM in terms of  * and ARI, which demonstrates that DWPCM improves the cluster efficiency without reducing the cluster accuracy.
In the future research work, we will investigate a further improvement of DWPCM to improve the effectiveness and efficiency of clustering big sensor data with many missing values.Additionally, for many semistructured and unstructured data in big sensor data, our future research plans will modify DWPCM to cluster the two types of incomplete data into appropriate groups.

Table 1 :
Averaged results of 10 trials on the eGASD set in terms of  * .

Table 2 :
Averaged results of 10 trials on the sWSN set in terms of  * .

Table 3 :
Averaged results of 10 trials on eGASD in terms of ARI.

Table 4 :
Averaged results of 10 trials on sWSN in terms of ARI.