Impact Parameter Analysis of Subspace Clustering

Subspace clustering, which detects all clusters in affine subspaces of a given high dimensional vector space, is used in various applications, including e-business. The performance and result of a subspace clustering algorithm highly depend on the parameter values the algorithm is tuned to execute. It may not be clear if the resultant clusters are indeed meaningful ones in a given dataset or if the result is just an artifact of the given parameter values. Although choosing the proper parameter values is crucial for both clustering quality and performance of the algorithm, there has been little research or discussion on this topic. In this paper, we propose a methodology for determining proper values of parameters in subspace clustering. Along with it, we validate our approach through experimental analysis, using various real-world datasets. The study can serve as a reference model for any subspace clustering experiment in which parameter setting is required to output clusters of quality.


Introduction
Recently, a group of algorithms called "subspace clustering" [1][2][3][4] are attracting academic interest for clustering high dimensional data. Clustering is a crucial task that is used in various applications, with the aim of detecting the dense regions of a given dataset, or as a prerequisite step for further processes, such as classification.
Subspace clustering can be widely used in many smart business application areas, which may include, but are not limited to the following [5,6].
(i) Product recommendations: the collaborative filtering technique is well known, and popularly used in the domain of product recommendation [7]. If the information conveying which customers have purchased what products is represented in a vector data model, to find out which customers have shown a similar purchase history becomes a subspace clustering problem [8].
(ii) Smart sensor logs: as electronic devices and storage media become cheaper and small devices such as smart phones become popular, the log information that is collected by using smart sensors is attracting more interest from industry. The log may represent the users' patterns and can be used in product searching or recommendation [9]. The number of sensors and their collecting data can be high and numerous. The log information can also be represented in a vector model.
(iii) Social network services: many social media sites such as Twitter provide users with a "follow" feature, which enables users to consume their own personalized contents. The user subscription information can be modelled in a high dimensional vector model [9]. Clustering users means to find out a group of users who have similar interests.
However, technological improvements in sensor, transmission, or storage domain have led to a flooding of high dimensional data, of which the dimensionality is typically equal to or greater than 10. With traditional clustering algorithms, which regard all dimensions equally, satisfactory results are hardly obtained, as the distance difference between 2 International Journal of Distributed Sensor Networks pairs of data objects collapses. Unlike traditional approaches, subspace clustering has been proposed as an alternative to detect all clusters residing in affine subspaces of a given high dimensional vector space.
Adopting a density-based clustering paradigm [10], a subspace cluster is defined as a connected component of objects where two data objects are considered as "connected, " if and only if the distance between the projections of two objects onto a given affine subspace is not greater than a given bandwidth. One advantage of a density-based clustering paradigm is that it can detect clusters with arbitrary shapes, so too for subspace clustering. For this reason, considerable researches have been published on this topic.
However, most of these algorithms share a critical problem in common, that of parameters. To conduct clustering, a number of parameters should be employed, some of which are as follows: bandwidth ( ), density threshold ( ), minimum cluster size (minSize), and duplication factor. All objects in each subspace cluster must be connected with no less than other objects on an associated subspace, and each cluster must have at least minSize objects. Amongst this clusters, only nonsimilar clusters with respect to a given duplication factor are included in the final results [5].
For the last two parameters, 1% of the whole dataset size and 0.1 have been widely used in multiple works [2,5]. However, there has been little literature or work on selecting the first two parameters, which heavily affect final clustering results. For example, if the value of is too large, the result may include too much noise. In contrast, if the value of is too small, we may get lossy results. The opposite situation can occur with regard to the value of . Moreover, not only does the selection of parameters impact the quality of clustering results, but also it heavily impacts the efficiency of the algorithm. The running time of the algorithm falls off as the value of decreases, and the value of increases, as the number of objects and connections that should be considered decreases accordingly. For these reasons, making a choice of adequate parameter values of ( , ) pair is crucial. If their values are inappropriate, applying a subspace clustering algorithm to a given input will result in poor output or excessive running time, or possibly both.
However, selecting proper parameter values is not a simple task, for preliminary information is not available in common. One possible method may be a trial-anderror approach, which repeatedly conducts clustering tasks with different combinations of parameter values and then finally selects the most satisfactory result. Nevertheless, this approach also has its own limit: as clustering is inherently a computation-intensive task, its running time is typically long, so trying lots of combinations of parameters may not be practical.
In this paper, we propose a parameter-search method based on random sampling. We perform the experiments to present the impact of parameters in subspace clustering and to find out their proper values in the domain. Experimental analysis shows that our approach is reasonable in various realworld datasets.

Strategy
To overcome the problems stated above, we propose a simple yet efficient approach. Our search strategy exploits a random sampling approach. It includes the following steps.
(1) Determine the value of the ( , ) pair that is computationally feasible for a given computation machine, using the full input set. That is, select the largest value of and smallest value of as far as is possible, so that it can be calculated in a desired timeframe by the given machine. We use max and min to denote each value, respectively. Of course, these values may be different with respect to the input.
(2) Select candidate value pairs. For arbitrary min and max , we can choose × candidate pairs Generate a random sample from the full dataset, without replacement.
(4) Run a given subspace clustering algorithm on the sample set, for all × candidate parameter pairs.
(5) Compare the results, using quality measurement. Choose the optimal parameter pair ( opt , opt ), in terms of the best quality.
For the reason indicated above, the running time with ( opt , opt ) on the original full dataset cannot be longer than that of ( max , min ). Therefore, the running time is guaranteed to be less than the given timeframe. One advantage of this approach is twofold: it not only allows trials on various candidates, but also deals with the time limit, which is very common in real-world application.

Experimental Setup
To validate our approach, we perform an experiment to check the efficiency of the strategy or whether this method may actually detect adequate parameter values. For the experiment, we use three real-world datasets with different dimensionality and characteristics from the UCI machine learning repository [11]: the Pendigit dataset with 16 dimensionalities and the Cell, Biodeg dataset with 30, 41 dimensionalities, respectively. From each of these datasets, we generate 3 input sets with different size: the input set with full population is generated by repeatedly selecting 10,000 objects from the original dataset. To normalize different dimensions, all element values are converted into the -score of the associated dimension. Then, two smaller input sets with 1,000 and 5,000 objects are generated, through random sampling without replacement.
To sum up, we prepare 9 input sets from 3 different original datasets and with 3 different sizes (Table 1).
For algorithm implementation, we use a distributed version of the subspace clustering algorithm introduced in [   In addition, we operate an additional ZooKeeper cluster, which serves as a distributed memory. The cluster consists of 3 commodity machines. Tables 2 and 3 summarize the hardware specification of each cluster. All nodes within each cluster run on a virtual hardware system provided by DigitalOcean (https://www.digitalocean.com/), with Ubuntu 13.10 × 64 and Oracle Java Runtime Environment version 7, update 40.
Using the settings in Tables 2 and 3, we compare the clustering results yielded from each input set with different ( , ) values. To fix the value of max and min for each dataset, we use random values for ( , ) and select a pair with which the execution of algorithm finishes in about 30 minutes. Table 4 shows our parameter settings. As each dataset has its classification label, the quality of clustering can be measured. For accuracy measurement [13][14][15], we use 1 score, as it considers both precision and recall values and is widely used in recent literatures [4,14,15]. In Table 4, all candidate values for ( , ) satisfy min ≤ ≤ max , min ≤ ≤ max . As the size of each input set differs, represents      Tables 5,6,7,8,9,10,11,12, and 13 show the 1 score values of clustering results from 9 input sets with varying ( , ) parameter settings, for an average of 5 independent trials. For each of the datasets, the three settings with the highest 1 score are in bold. The result shows that the parameter settings that yield the most satisfactory clustering results are almost the same, regardless of the size of the input set.     For example, Table 5 shows the 1 score for pendigit-1000 input set. The table shows that none of the trials with < 10 results in a cluster. Table 6 shows the case of pendigit-5000 input set. Similar to the case of pendigit-1000, the result shows that parameter candidates with < 10 detect no clusters. This phenomenon is common throughout other input sets.

Results Analysis
In the case of Pendigit dataset, the Pearson correlation coefficient of 1 values between the full population and 10% sample was 0.5945, and the one between the full population and 50% sample was 0.7966, which suggests a strong positive linear relationship between them. In the case of the Cell and Biodeg datasets, the values of the Pearson correlation   Tables  8-10 with axis for parameter candidates (sort by first) and axis for 1 value. coefficient were 0.4915 and 0.6400, and 0.6985 and 0.9960, respectively. Figures 1, 2, and 3 show the trend of 1 value between the two sample sets and the full population.
The results suggest that the value of the most adequate parameter setting is not affected by the size of the input set. That is, it is a reasonable strategy to estimate optimal parameter values with a small sample of the full dataset to achieve both time efficiency and accuracy at once, evading the time limit. However, keeping the sample rate too low is not a good choice: as shown in Figure 3, the correlation coefficient between the 10% sample and the full population was only 70.1%∼76.7% of the coefficient between the 50% sample and the full population. This trend suggests that trying with too small sample set may distort the result of estimation. Moreover, trying with too small min is also not recommended. Although our data consists of -score values that are distributed on the range [0.0, 100.0), more than half of the candidates with ≤ 10.0 yield no results.

Conclusions
Based on the experimental evaluation and result analysis, we propose the following methodology for estimating the optimal parameter values for subspace clustering. First, determine the value of ( max , min ) with which algorithm execution on the full input set finishes in the desired timeframe. Secondly, prepare candidate values for ( , ) that satisfy ≤ max and min ≤ . Then, select one from these candidates as ( opt , opt ) that yields the best quality, when applied with the reduced input set of the original one, for example, the 10% random sample.
The advantage of this strategy may be twofold: it not only allows searching and comparison on various combinations of candidates, but also helps predictability of estimating the execution time. Experimental results with real-world datasets suggest that parameter values obtained from this approach can show the best accuracy on the full population of the input set.