Robust Distance Measures for kNN Classification of Cancer Data

The k-Nearest Neighbor (kNN) classifier represents a simple and very general approach to classification. Still, the performance of kNN classifiers can often compete with more complex machine-learning algorithms. The core of kNN depends on a “guilt by association” principle where classification is performed by measuring the similarity between a query and a set of training patterns, often computed as distances. The relative performance of kNN classifiers is closely linked to the choice of distance or similarity measure, and it is therefore relevant to investigate the effect of using different distance measures when comparing biomedical data. In this study on classification of cancer data sets, we have used both common and novel distance measures, including the novel distance measures Sobolev and Fisher, and we have evaluated the performance of kNN with these distances on 4 cancer data sets of different type. We find that the performance when using the novel distance measures is comparable to the performance with more well-established measures, in particular for the Sobolev distance. We define a robust ranking of all the distance measures according to overall performance. Several distance measures show robust performance in kNN over several data sets, in particular the Hassanat, Sobolev, and Manhattan measures. Some of the other measures show good performance on selected data sets but seem to be more sensitive to the nature of the classification data. It is therefore important to benchmark distance measures on similar data prior to classification to identify the most suitable measure in each case.


2
Cancer Informatics of data, consisting of categorical, numerical, and mixed data types. The sets were from the UCI repository of data sets for machine learning, and they used 4 different distance measures; Euclidean, cosine, chi-square, and Minkowski. They used cross-validation (70% training and 30% testing) to measure the performances, with k-values between 1 and 15. The experiments showed the chi-square distance measure to be best for the 3 different data types, whereas the cosine, Euclidean, and Minkowski distances lead to the lowest accuracy on the mixedtype data set.
Punam and Nitin 15 used the KDD data set 16 and the kNN classifier with Chebyshev, Euclidean, and Manhattan distance measures. The KDD data set contains numeric data for 41 features in 2 classes. They estimated accuracy, sensitivity, and specificity to evaluate the performance of kNN for each distance. The Manhattan distance outperformed the other distances, with 97.8% accuracy, 96.76% sensitivity and 98.35% specificity.
Todeschini et al 17,18 investigated the kNN classifier on 8 benchmark data sets with 18 different distance measures, including Manhattan, Euclidean, Soergel, Lance-Williams, contracted Jaccard-Tanimoto, Bhattacharyya, Lagrange, Mahalanobis, Canberra, Wave-Edge, Clark, Cosine, Correlation, and 4 locally centered Mahalanobis distances. The rate of non-errors and average rank for each distance was determined to evaluate the efficiency of the measure. The results indicated that the highest accuracy was achieved for the Manhattan, Euclidean, Soergel, contracted Jaccard-Tanimoto, and Lance-Williams distance measures.
In a comprehensive review study Prasath and colleagues 19 investigated the impact of 54 different distance measures on 28 various data sets that were obtained from the UCI machinelearning repository. On most data sets, their work showed the best performance by using the Hassanat distance, compared to the other distances.
In summary, these benchmarking studies (and others) have shown that no distance metric is optimal for all data types. Each data type may require a different distance metric for optimal performance in kNN, which is consistent with the principle of "no free lunch." This makes it relevant to ask how we can guide users with respect to the choice of distance metrics for kNN classification of complex data sets to achieve optimal performance. Here, we have tried to answer that question by identifying metrics with relatively consistent performance across a range of complex data sets, using a selection of both common and more novel metrics.
Specifically, we have investigated the performance of kNN classification with 12 different distance metrics, including 8 Diamond and square shaped neighborhoods are generated by the Chebyshev and Manhattan distances, respectively. In this case, a new query pattern (blue star) would be classified as either green or red by Chebyshev and Manhattan distances, respectively. common and well-known metrics (Euclidean, Manhattan, Canberra, Chebyshev, Bray-Curtis, Clark, Hamming, Bhattacharyya), 2 more novel metrics (Hassanat and Soergel), and 2 new metrics presented by us (Sobolev and Fisher). We have tested these metrics on 4 different data sets on cancer; for breast cancer (cytology), brain cancer (imaging), lung cancer (multivariate), and prostate cancer (clinical). We have evaluated the overall performance of each metric by ranking the metrics according to classification performance across these data sets.

Data sets
The experiments were done on 4 cancer data sets, for brain, lung, breast, and prostate cancer (see Table 1).
For brain cancer, we used a data set consisting of 2-dimensional (2D) slices of CE-MRI images for 3 types of tumors; glioma, meningioma, and pituitary tumor. Data for 233 patients with a total of 3,064 images (axial, coronal, and sagittal views) were available. The original size of each image was 512 × 512 pixels, which has been decreased to 64 × 64 to make the calculation faster. The breast and lung cancer data sets were benchmark data sets obtained from the UCI Machine-Learning Repository. The Wisconsin Breast Cancer Data set (WBCD) has 699 instances with 9 attributes for cytology data on 2 types of tumors (i.e. malignant and benign). The lung cancer data set is a multivariate data set with 55 attributes for 32 instances. The prostate cancer data set is a data frame with 97 rows and 9 features with data from a study examining the correlation between the level of prostate-specific antigen and several clinical parameters, using data from participants about to receive a radical prostatectomy.

Distance measures
Here, we give mathematical formulas for distance measures estimating the closeness between 2 vectors x and y , with having numerical attributes. The d x y m ( , ) is the distance between x and y as measured by m . Formulations and terminologies are mainly taken from Abu Alfeilat et al, 19 with additional definitions as specified.
Minkowski, Euclidean, Manhattan, and Chebyshev distance. This family of distances is defined as: where p is a positive value. It is the Manhattan distance when p = 1, and the Euclidean distance when p = 2, whereas the Chebyshev distance is a variant of Minkowski distance where p = ∞. This is also known as maximum value distance, 23 Lagrange, 17 and chessboard distance, 24 and can be formulated as: Canberra distance. This weighted version of the Manhattan distance was introduced and later modified by Lance and Williams. 25 Hamming distance. This distance is based on the number of differences between 2 vectors. 27 It is mainly used to analyze nominal data but can also be used for numerical data.
Sorensen distance. This distance is often used to describe relationships in areas like ecology and environmental sciences, 29 and it is also known as Bray-Curtis. It is a modified Manhattan distance, where the total sum of the values is used to standardize the difference over the vectors x and y. 30 It will be between 0 and 1 when all values of the vectors are positive.

Cancer Informatics
Clark distance. This distance 31 is also known as the coefficient of divergence and is the square root of half the divergence distance.
Soergel distance. This distance (also known as the Ruzicka distance) is widely used for calculating evolutionary distances. 32 It is identical to the complement of the Jaccard or Tanimoto similarity coefficient for binary variables, 32  where D x y x y x y x y Sobolev distance. Definitions and notations for this distance are as given by Villmann. 35 Starting with the standard p-inner product the Sobolev inner product, norm, and metric of degree k can be defined as follows: where D k is the kth differential operator. There is a connection to the Fourier transform for the special case p = 2 and α = 1. Let x  be the Fourier transform y where ω π k k N = 2 / and i = −1 . The norm can be defined as Here, we have used metric (13) with norm (15) and k = 1.
Fisher distance. Definitions and notations are as given by Lebanon. 36 We first define the n-simplex P n .
The sequence { } x i is the probability of different outputs in each experiment. The Fisher information metric on P n can be defined by The Fisher information is defined as a pull-back metric from the positive n-sphere S n + ; The transformation T P S n n ,. . ., pulls back the Euclidean metric on the surface of the sphere to the Fisher information on P n . Now Fisher metric for x y P n , ∈ can be defined as the length of the great circle (geodesic) between T x ( ) and T y ( ) on S n + . The relevant performance measures can then be defined as:

Ranking of distance measures
For each distance and performance score, we considered the best (maximum) score among scores across all different k K ∈ values as the final score. If S dpe k is the score of distance d for performance p and experiment e , the final score can be defined as: We then ranked distances according the final score for each individual experiment, using 2 different approaches. The first approach was simply to compute the average of the ranks across all experiments. That is, for a given experiment e and a given performance measure p , the score S was computed for each distance metric d , and the distance metrics were ranked according to the score. This was repeated for each combination of e and p , giving e p × rankings in total. The final ranking was then estimated as the average ranking of distance metric d over all these e p × rankings. For the second approach, we used RankAggreg tool, 37 an R package for weighted rank aggregation, and we used it on the complete set of ranked lists as described above, using the Cross Entropy Monte Carlo (CE) method, Kendall distances, and a value of rho as 0.1 (please see the RankAggreg documentation).
In addition to the ranking, we used the k-means algorithm to cluster the distance measures based on the scores over all experiments, and plotted this using the factoextra 38 tool in R. This highlights in a visual way the similarities and differences between the tested distance measures.

Software implementation
The Python programming language (version 3.7.1) was used for scripts, which were implemented under Anaconda3. We used libraries from the scikit-learn package (version 0.20.1) to apply the kNN algorithm for Euclidean, Manhattan, Chebyshev, Hamming, Canberra, and Bray-Curtis distances.

Results
We applied all 12 distance measures on the 4 cancer data sets. For the brain, breast, and prostate cancer data sets, we used ranges from 1 up to 20 for k. For the lung cancer data, the range of k was limited to values from 1 up to 11, due to more limited data.
The best scores for the brain cancer data are shown in Table 2. The best precision score is for Canberra followed by Sobolev and Hassanat. For recall the maximum is shared between Manhattan and Hamming. Second and third places are for Sobolev and Hassanat. The best performances based on F 1 and accuracy were for Canberra and Hassanat, respectively.
The scores for the breast cancer data are shown in Table 3. The Clark distance achieved the best score for 3 performance measures: recall, F 1 , and accuracy. The best precision was for the Bray-Curtis distance.
For the lung cancer data, the Sobolev distance outperformed the other distances, as it had the best performance according to   precision, F 1 , and accuracy. The second rank was for Fisher distance, which achieved the best score for recall and shared F 1 with Sobolev. Finally, for the prostate cancer data, the Canberra distance clearly outperformed the other distances according to all performance measures.
To have a total and robust ranking scale we used 2 approaches as described under Methods: a basic average of ranks for each distance measure estimated over 16 different rankings (i.e. all possible combinations of data set and performance measure), and the weighted rank aggregation of these rankings by the RankAggreg tool.
To compare the ranking of these 2 approaches, we plotted the 2 rankings, as shown in Figure 2. This shows a good correlation between these rankings, indicating that the overall ranking of the distance measures is robust.
The result of the k-means clustering on the performance scores over all experiments for k = 3 are shown in Figure 3. The set of (Hassanat, Canberra, Sobolev, Manhattan, Euclidean, Soergel, Bray-Curtis) forms a relatively tight cluster, whereas the 2 additional clusters are (Hamming, Chebyshev, Clark) and (Bhattacharyya, Fisher). This is quite consistent with the ranking in Figure 2, where the main cluster is seen to consist of the measures with the best overall performance. A clustering with k = 4 splits the main cluster into 2 subclusters consisting of (Sobolev, Manhattan, Euclidean) and (Hassanat, Canberra, Soergel, Bray-Curtis), but the general clustering is stable. In summary, the k-means clustering confirms the ranking of the performance data shown in Figure 2.

Discussion
The results presented here show clear differences between distance measures with respect to classification performance on the cancer data sets. Some distance measures have a quite robust performance across most data sets, whereas other measures show a clearly lower performance on some data sets. This seems to be largely independent of which performance measures that are used (precision, recall, F 1 or accuracy), which seems to be confirmed by the loading plot of a principal component analysis (PCA) of the performance data from Tables 2 to 5 (Supplemental Figure S2 in Additional file 1). The plot shows very similar loadings for all performance measures for each data set, in particular for the data on breast cancer and lung cancer.
The individual classification results in Tables 2 to 6 show important differences (and similarities) between the distance measures, depending on data type. If we focus on the F 1 performance measure, we see that both Fisher and Bhattacharyya seem to have relatively low performance on brain cancer (Table 2), breast cancer (Table 3), and prostate cancer (Table 5), in addition to Hamming for prostate cancer. This is different for lung cancer (Table 4), where it is Clark and Chebyshev that is associated with 7 low performance. These differences seem to be confirmed by the k-means clustering (Figure 3), where both (Fisher, Bhattacharyya) and (Clark, Chebyshev, Hamming) form separate clusters, and by the PCA analysis, where the loadings for breast cancer data are clearly separated from the other cancer types (Supplemental Figure S2 in Additional file 1). It is also consistent with the ranking data shown in Figure 2, where these same distance measures are ranked together as having low performance.
The ranking of the well-performing measures shows some variation, but this is mainly due to the generally good performance of these measures, with only small (and partly random) differences between cases. However, it is important to realize that the performance of a given distance measure depends on the input data. For example, in the data on lung cancer (Table 4) the Fisher measure shows one of the best performances, whereas it shows low performance on the other data sets. Similarly, the Clark measure is the best-performing measure on breast cancer data (Table 3) but has very low performance on lung cancer data. Apart from intrinsic effects of the type and distribution of data, these differences could arise from the distance functions, which is something that is relevant for further studies.

Cancer Informatics
The analysis presented here may be influenced by the quality of the input data, for example, whether cases in the training set are correctly annotated with respect to class (e.g. cancer versus normal). In principle, we can estimate the quality of training data by looking for consistent misclassifications, experiments where a case consistently is classified to a different class compared to its annotation. Such cases may represent potential annotation errors in the data set and may be considered for removal. However, we should probably expect to have some examples of such cases in most data sets consisting of experimental data. In particular for complex properties like cancer, where it may be difficult to decide unambiguously in each case whether a given sample should represent "cancer" or "normal." In the data presented here, the somewhat lower classification performance on brain cancer data and lung cancer data can possibly be linked partly to misannotated cases. However, such cases will be a natural part of most experimental data and removing them may introduce user bias into the analysis. Also, kNN is supposed to be somewhat robust with respect to errors in training data, in particular for higher values of k, as the classification will represent an average over multiple cases. Therefore, we have not considered removing such cases from the analysis. This analysis will also be influenced by the choice of features, for example, if we select only specific features for analysis, compared to the full range of features of a data set. This may for example be relevant if the features represent very different properties. Again, selecting subsets of features may introduce user bias into the analysis. Here, we wanted to test the robustness of the various distance metrics, and therefore, we decided to use all features as given in the original data sets, without any feature selection.

Conclusions
The performance analysis of kNN classification of cancer data with different distance measures identifies important differences between both distance measures and data sets. It is possible to identify a subset of distance measures that show robust performance across several data sets, and this includes the Hassanat, Sobolev, and Manhattan measures. However, the study also confirms that no single distance measure will be optimal for all data sets, and the recommendation must be that several measures should be tested on suitable reference data that are as similar to the actual data as possible when selecting distance measure for a particular study.

Supplemental Material
Supplementary material for this article is available online.