Feature selection for binary classification based on class labeling, SOM, and hierarchical clustering

Feature selection plays an important role in algorithms for processing high-dimensional data. Traditional pattern classification and information theory methods are widely applied to feature selection methods. However, traditional pattern classification methods such as Fisher Score, Laplacian Score, and relief use class labels inadequately. Previous information theory based feature selection methods such as MIFS ignore the intra-class to tight inter-class to sparse property of the samples. To address these problems, a feature selection algorithm for the binary classification problem is proposed, which is based on class label transformation using self-organizing mapping neural network (SOM) and cohesive hierarchical clustering. The algorithm first converts class labels without numerical meaning into numerical values that can participate in operations and retain classification information through class label mapping, and constitutes a two-dimensional vector from it and the attribute values to be judged. Then, these two-dimensional vectors are clustered by using SOM neural network and hierarchical clustering. Finally, evaluation function value is calculated, that is closely related to intra-cluster to tightness, inter-cluster separation, and division accuracy after clustering, and is used to evaluate the ability of alternative attributes to distinguish between classes. It is experimentally verified that the algorithm is robust and can effectively screen attributes with strong classification ability and improve the prediction performance of the classifier.


Introduction
With the development and application of information technology, more and more high-dimensional data such as: digital images, financial time series, and gene expression microarrays have been accumulated.Feature selection has become an indispensable pre-processing step in algorithms for processing high-dimensional data.Feature selection refers to the process of selecting a subset of features from all attributes that are most beneficial for subsequent operations to reduce the dimensionality of the feature space while keeping the decision-making capability of the information system unchanged.As an important part of knowledge discovery techniques, feature selection can improve the speed of knowledge learning, enhance the compactness of learning models, increase the generalization ability of models, 1 and enable one to utilize data with minimal complexity in order to improve the level of awareness of information implicitly contained in huge data.
3][4][5][6] It uses the intrinsic characteristics of the training data to judge the merits of the attributes to be selected and is independent of the learning algorithm. 3ccording to the evaluation measures adopted, filter algorithms can be roughly classified into: information, distance, dependency, consistency, similarity, and statistics measures. 4,7Existing filter methods can be broadly categorized as either univariate or multivariate. 4,5,8Univariate methods (i.e., feature weighting/ ranking) test each attribute and score them according to their relevance to class labels. 4,5FS, 9 LS, 10 SPEC, 11 and relief 12 are typical of univariate assessments using different evaluation measures. 4There is a consensus in univariate assessment for attributes that facilitate classification: similar (or close) attribute values for samples of the same class; large differences (or far away) in attribute values for samples of different classes; and class labels of samples containing important information that helps in attribute screening. 13,14The FS, LS, SPEC, and relief algorithms score single attributes based on the intra-class to tight inter-class to sparsity criterion, during which relief uses sample class labels implicitly in a nearest sample manner, 12 FS, LS, and SPEC in a graphical manner. 11,15(In fact, some recent algorithms [16][17][18][19] have also used graph to organize the attributes.)Multivariate methods directly focuses on those combinations of attributes that can represent all attributes, and a group of algorithms based on information theory [20][21][22][23][24][25] represented by MIFS 20 is the main representative of multivariate method. 5The MIFS algorithm takes advantage of the characteristics that mutual information can describe the nonlinear correlation and spatial transformation invariance among attributes, 23 and uses the mutual information between attributes and class labels and between attributes and attributes as the basis for whether attributes can enter the feature subset, but it ignores the intra-class to tight inter-class to sparsity property to some extent.
To address the above problems, this paper proposes a feature selection algorithm based on class labels, selforganizing mapping neural network (SOM) and hierarchical clustering (FS-CSH) for binary classification, that takes into account intra-class to tight inter-class to sparsity and explicitly uses class labels.The main ideas of this algorithm are (1) mapping class labels, converting class labels without numerical meaning into values that can participate in operations and retain classification information, and forming a two-dimensional vector from it and the attribute values to be selected; (2)  clustering this two-dimensional vector using SOM neural networks and hierarchical clustering; (3) calculating the evaluation function values closely related to intra-cluster to tightness, inter-cluster separation and division accuracy after clustering as attribute scores.The algorithm measures the relationship between all attributes and classes, and evaluates the ability of attributes to distinguish between classes.It is experimentally verified that the algorithm is robust, and can effectively rank the attributes according to their classification ability, and improve the prediction performance of the classifier.

Problem formulation
Without loss of general description, let X = X 1 , X 2 , ½ Á Á Á , X K T be the sample vector and K be the number of samples; A = ½A 1 , A 2 , Á Á Á , A M be the attribute vector and M be the number of attributes; the kth sample is denoted as , where a km is the mth attribute value of the kth sample; and y k 2 0, 1 ½ is the category label of the sample X k .
For binary classification, the general naive view of whether an attribute A m is good for distinguishing categories is whether there is an obvious single mapping relationship between the distribution of A m values and the class labels class1 and class2, although this relationship is implicit.When the distribution of A m values is naturally distributed in two regions with an obvious implicit single mapping between the class labels class1 and class2, then A m is considered to have the ability to distinguish between categories, as shown in Figure 1(a), where the horizontal axis is the A m values, and the vertical axis is the probability density function of the distribution of A m values.This plain understanding is expressed in the data space as intra-class to tightness and inter-class sparsity of A m values, which are typically represented by LS, FS, and relief.The LS algorithm scores the mth attribute A m as equation ( 1), where D is the degree matrix and L is the Laplacian matrix.
FS selects attributes with close intra-class values and scattered inter-class values.It scores the mth attribute A m as equation (2).Where, m m is the mean value of A m , n j is the number of samples belonging to the jth class, m m, j and s m, j are the mean and variance of A m on the jth class.
The relief scores the mth attribute A m as equation (3).Where a km is the value of the sample X k on the attribute A m , a NH X k ð Þ, m and a NM X k ð Þ, m are the values of the nearest samples that are the same class and different class of X k on the attribute A m , and There is no trace of sample class labels in the above expressions.In fact, the implicit mapping between the distribution of A m values and class labels in Figure 1(a) can be viewed as a transformation of the sample-based conditional probabilities p a km jy k ð Þ, which is represented by the implicit use of class labels in the way of graphs in FS and LS, in the way of nearest samples in relief.This use of class labels is insufficient.
In addition, when the implicit single mapping between the distribution of A m values and class labels is not obvious, the classification ability of A m becomes weaker, as shown in Figure 1(b); when this mapping disappears, A m loses its classification ability, as shown in Figure 1(c).However, if class labels are considered in an explicit way in the third dimension when examining this single mapping, then the originally unclassifiable A m becomes capable of distinguishing classes with the addition of the third dimension, as shown in Figure 1(d), where f class ð Þ is a transformation function containing class labels, and information theory-based feature selection is one such class of algorithms, of which MIFS is a typical representative.The mutual information between attribute A m and class labels in MIFS is equation (4).
MIFS makes use of class labels in an explicit way, and it focuses on mutual information to evaluate the relevance of attributes and classes, and attributes to attributes, but it underexpresses the sample property of ''intra-class to tight, inter-class to sparse.''In summary, this paper investigates a way to explicitly fuse class labels with attributes, and organizes the fused data in a clustering manner to discriminate the class differentiation ability of attributes based on inter-cluster and intra-cluster distances, and proposes a feature selection algorithm based on class labels, SOM, and hierarchical clustering.

Feature selection for binary classification based on class labeling and clustering
The proposed algorithm is shown in Table 1, which evaluates all the attributes one by one.Because of the need to explicitly use the guidance role of the class labels, this algorithm firstly fuses the sample values with their class labels to generate new 2-D vector samples.Then considering that attributes with significant classification performance possess the characteristics of similar (or close) attribute values for samples of the same class and large (or far) differences in attribute values for samples of different classes, this algorithm uses SOM neural network clustering and hierarchical clustering to automatically organize the division of the two-dimensional vector samples; Finally, the evaluation function of attribute A m is calculated to evaluate the intra-cluster densities, inter-cluster sparsity, and the correct rate of division.

Mapping class labels
In order to play a guiding role of class labels in the feature selection process, the attribute values of the samples are combined with the class labels.For the attribute Þis combined with the class label of the sample y k to form a two-dimensional vector a km , y k ½ .We note that y k in vector a km , y k ½ is the sample X k class label, which cannot be directly involved in mathematical operations in the general sense.Therefore, the following transformation has to be made.
Using equation (5) to normalize all the sample values of Equation (6) gives the normalized sample mean a m of A m : Definition 1: sample X k numerical label y 0 k : If the class label of sample X k is y k 2 0, 1 f g, then the numerical label of sample X k is noted as y 0 k : Equation (7) maps the class label y k 2 0, 1 f g of the sample to the numerical label y 0 k .The modulus of y 0 k is normalized to the sample mean a m of the attribute A m , and the sign of y 0 k is determined by the class label.This converts the label information y k , which has no numerical meaning, into a value that can participate in the number operation and retains the classification information.Finally, a 0 km , y 0 k Â is mapped onto the unit circle using equations ( 8) and (9).
Through the above steps, the two-dimensional vector are used as inputs to the SOM network.
Training the SOM network SOM neural network is different from other neural networks and also different from other clustering algorithms.It is able to reproduce the topological relationships among the input patterns with the topological relationships of the output layer neurons obtained from the mapping, and the spatial distribution of the trained network connection weight vectors reflects the statistical properties of the input patterns. 26The general SOM neural network adopts a ''planar'' topology, which maps the high-dimensional input patterns to the output layer neurons with two-dimensional planar distribution through competition, cooperation, and adaptive steps, and reflects the aggregation characteristics of the input patterns by the distribution characteristics of the weight vectors of these neurons to form clusters after clustering. 27At this case the clusters are distributed on a two-dimensional plane, which can complicate the subsequent hierarchical clustering.In fact, as mentioned in the literature, one-dimensional SOM is significantly superior than two-dimensional SOM in many aspects such as maintaining and achieving linear separability of classes, expressing the similarity between data and the clarity of inter-class location relationships, and the ease of visualizing class boundaries. 28In addition, using clustering methods based on one-dimensional SOM to cluster unknown datasets can overcome the problem of not knowing the structure of the dataset and not being able to choose the correct clustering method, avoid the phenomenon of not getting the correct clustering results due to the unsuitability of the method, and provide a method to ensure that the clustering and structural characteristics of the unknown dataset can be better discovered. 28The one-dimensional SOM can be used as the basis for clustering any type of dataset.Therefore, in this paper, a one-dimensional SOM is chosen and the output layer neurons are organized in a ring topology, as shown in Figure 2. It has l source nodes in the input layer and n neurons in the output layer, with full connectivity between neurons and source nodes.The neurons in the output layer are train the SOM network, as shown in Table 2.
The winning neuron during the competition for SOM training is noted as I A Ã km : Where d n, I A Ã km is the lateral Euclidean distance of the winning neuron I A Ã km from neuron n in output space: s ite ð Þ is the ''effective width'' of the Gaussian topological domain, which shrinks in time and determines the shrinkage of the Gaussian topological domain as the number of iterations ite increases: To accelerate the rate of Gaussian topological field reduction, the constant t 1 is taken as: h is the learning rate, starting from an initial value of h 0 and decreasing thereafter as the number of iterations ite increases: To ensure the convergence accuracy of SOM training, the constant t 2 is taken:  Table 2. Training SOM network.

Algorithm 2 Training the SOM network
Output: a weight matrix W with N rows and 2 columns 1: (Initialization) Assign random values in the 0, T , N is number of neurons in the output layer.Set the initial learning rate h 0 .Set initial value s 0 = N=2 ½ , which is the effective width of the Gaussian topological field.Set number of iterations of training iteMax.

2: (Prepare
end for 8: end for 9: end for 10: Return W At the end of the network training, the neurons in the output layer are mapped and shot to different locations in the ring topology, which was influenced by the two-dimensional vector Á Á Á , K.They will serve as clustering centers, ready for reclassification

SOM primary clustering
In order to achieve binary clustering of the twodimensional vectors the sample attribute values and class labels, it is first necessary to perform primary clustering of them for generating the basic clusters, as shown in Table 3.Since the neurons in the output layer of the SOM correspond to obtained from the training, the neurons in the output layer are represented by W T n , n = 1, 2, Á Á Á , N, and they are used as the centers of the clusters to construct the sets cluster centers is calculated one by one, and the cluster center closest to If there is a primary cluster which modulus equals to 1, it means that the primary cluster contains only the cluster center without any two-dimensional vectors

Cohesive hierarchical clustering of basic clusters
On the basis of the primary clustering, cohesive hierarchical clustering is performed on the basic clusters until only two clusters remain.At this point all twodimensional vectors , different values of the same attribute are divided under the influence of class labels), and the evaluation function J CSH A m is calculated as the score of the attribute (i.e., the basis for selecting the attribute), as shown in Table 4.
In the hierarchical clustering, the Euclidean distance between the neuron in the output layer and its right neighboring neurons is calculated in turn with reference to the ''circular'' topology of the SOM output layer and stored in the distance set D; then the basic clusters are (Merging basic clusters) There are N Ã base class clusters, so merge N Ã À 2 ð Þtimes.Every time the two closest clusters are merged, then updating the merged cluster center and inter-cluster distance.8: g !D 18: end for 19: (Calculating evaluation function value) J CSH Am 20: Return J CSH Am merged cyclically: The two closest clusters are selected from D for merging, and the centers of the merged clusters are Where k k are the proportion of samples in the two clusters involved in the merge.
Collate the set C of clusters and update the set D of distances by calculating the distances between the new cluster and its left and right neighbors.When the loop merge is completed, calculate the evaluation function J CSH A m for the attribute A m .
J CSH A m reflects the intra-cluster denseness, intercluster sparseness, and the correctness of the division.Thus the J CSH A m is influenced by the intra-cluster distance, inter-cluster distance, and the division correctness, and the smaller its value, the stronger the category differentiation ability of the corresponding attribute.
Definition 2: Intra-cluster distance: The arithmetic mean of the Euclidean distance of the two-by-two combination X i , X j À Á of all sample points in the cluster C k is the intra-cluster distance of the cluster C k , denoted as d in C k ð Þ: Where W k and W l are the C k and C l cluster centers, respectively.
Definition 4: Division accuracy: Two clusters C 1 and C 2 are obtained from the basic cluster C n , n = 1, 2, Á Á Á , N Ã after cohesive hierarchical clustering, then the division accuracy is denoted as Accuracy C 1 , C 2 ð Þ: Where b to the number of positive classes judged as positive; fp refers to the number of negative classes judged as positive; tn refers to the number of negative classes judged as negative; fn refers to the number of positive classes judged as negative.
Definition 5: Evaluation function: The evaluation function for the attribute A m , m = 1, 2, Á Á Á , M ð Þis denoted as J CSH A m : Time complexity analysis of the algorithm In terms of time complexity, firstly, the number of attributes M affects the time overhead of the above algorithm as it calculates the evaluation function for all attributes one by one.Secondly, the training process of the SOM network takes up most of the time compared to the initial and hierarchical clustering process when examining any of the attributes, and the time overhead of the process is mainly determined by the number of iterations iteMax; Thirdly, the Euclidean distance between the two-dimensional vector and the N neurons in the output layer of the SOM network is calculated in each iteration of the training process of the SOM network.Therefore, the time complexity of the proposed algorithm can be estimated as Obviously, once N and iteMax are determined (i.e., the number of neurons N in the output layer of the SOM network is fixed, and the number of iterations iteMax that can ensure the convergence of the SOM network weights is determined), the time complexity of the proposed algorithm will be mainly affected by the size of the data set, which can be estimated as

Experimental analysis and discussion
Software resources used for the experiments include: Python 3.8.11(https://www.python.org/)and Spyder IDE 5.1.5(https://www.spyder-ide.org/).They are used to provide a scripting language environment and an integrated development environment, respectively.The hardware platform of the computer used for the experiments is mainly: Intel Core i7-10750H 2.60 GHz processor and 8 GB 2933 MHz memory.
In order to test the algorithm proposed in this paper, firstly, artificial data are used to verify its ability and robustness in distinguishing attributes; secondly, for real data from different sources, the algorithm of this paper is used to compare with classical mainstream algorithms (such as MI, 20 FS, 9 LS, 10 and reliefF 12 ) for feature selection and compare the performance of classifiers after feature selection by LR (Logistic Regression), K-NN (K-Nearest Neighbor), DT (Decision Tree), and SVM (Support Vector Machine) classifiers to evaluate the practicality of the FS-CSH algorithm.Regarding LR, the regularization parameters adopts ''l2,'' the optimization method of loss function is ''liblinear,'' the residual convergence condition is specified as 10 À4 , the maximum number of iterations for the algorithm to converge is 100; regarding K-NN, the number of nearest samples is 5, the samples with voting rights are voted according to equal weight, the size of leaf nodes is 30; DT uses ''gini'' division criteria, with a threshold of 2 for stopping division of nodes; SVM uses the RBF kernel function type, with a stopping training error accuracy of 10 À3 , and a heuristic shrinkage.
The relevant parameters of the SOM network in the experiments are set as follows: number of input units SOM input = 2, number of output units N = 21, number of iterations of training iteMax = 1000, initial value of learning rate h 0 = 0:2.

Artificial data sets
Two-dimensional uniformly distributed data.Consider the classification problem in the two-dimensional feature space of Figure 4. 20,29 The attribute vector of the sample X, Y ð Þ is uniformly distributed in 0, 1 ½ 3 0, 1 ½ .When the X attribute value of the sample is x \ a and the Y attribute value is y \ b, where b = 1= 2a ð Þ, the sample belongs to class 1, otherwise the sample belongs to class 2. In Figure 4, when 0:5 \ a \ ffiffi ffi 2 p 2, X has a stronger discriminative power than Y, and when ffiffi ffi 2 p 2 \ a \ 1:0, X is not as strong as Y.As the literature mentioned: good attributes can be selected before the learning process starts, and this selection does not depend on the details of the learning algorithm (including the initial weights and convergence of the algorithm). 20The normalized Fisher linear discriminant vector (FLD), 20 the mutual information (MI), 20 and the evaluation function values J_CSH of the proposed algorithm are shown separately in Figure 5, which reflected the trend of ''attribute to class differentiation ability'' as a increases in steps of 0.01 from 0.5 to 1.0.Apparently, J_CSH is able to distinguish attributes with class classification capability, just like FLD and MI.Two-dimensional Gaussian mixed probability density data.The robustness of the algorithm was verified by choosing different Gaussian probability densities for the case of binary classification of two-dimensional Gaussian mixed distribution data. 20Consider a two-dimensional sample space attributed to two classes, with samples described by the attribute vector X, Y ð Þ.The X-dimensional component of class 1 obeys a Gaussian distribution with 0 mean, and the X-dimensional component will gradually elongate with increasing standard deviation s 1x in different tests.The X-dimensional component of class 2, on the other hand, obeys a Gaussian distribution with fixed mean and standard deviation m = 0:5, s 2x = 0:1 ð Þ .Relative to class 1, the sample of class 2 is shifted by 0.5 in the X-dimensional direction as a whole.Both classes obey a Gaussian distribution with 0 mean and 0.1 standard deviation in the Y-dimensional direction.The probability density formula based on the one-dimensional Gaussian distribution.
The Gaussian mixture probability density of class 1 and class 2 is simply expressed as: To simulate a real classification task, a series of value-added tests of s 1x were performed (the value of s 1x gradually increased from 0.1 to 6.4).1000 samples were selected with the same probability in both distributions, and Figure 6 shows the distribution of samples as s 1x increases from 0.1 to 0.8.The Y-dimensional component of the samples in Figure 6   information MI can accurately distinguish the attribute X that is valid for classification, while FLD has misses.Specifically, the value of the Y-dimensional attribute evaluation function J CSH y, class ð Þ remains constant, and the value of the X-dimensional component J CSH x, class ð Þ is small and close to half of J CSH y, class ð Þ, through which J CSH can filter the attribute X that is favorable for classification.For comparison, the value of the mutual information of the Y-dimensional attribute I y, class ð Þis also constant and close to 0, and the value of the X-dimensional attribute mutual information I x, class ð Þ is larger than that of I y, class ð Þ, through which the mutual information MI can also filter the attribute X.However, when the two categories are clearly distinguishable (s 1x = 0:1), the mutual information of the X-dimensional component I x, class ð Þis close to 1; as class 1 expands and gradually covers class 2, I x, class ð Þ tends to decrease; when s 1x takes a larger value, I x, class ð Þincreases again, which is the case in the last row of the table, when class 2 has completely fallen into class 1 in the X-dimensional direction and its probability density function covers a much smaller area than class 1.This fluctuation is caused by estimating the probability density function by segmenting the statistical event frequencies from a limited number of samples.In addition, the results of Fisher's linear discriminant analysis are presented in the last two columns of Table 5, where the two FLD components lose the ability to discriminate between the two attributes when s 1x takes larger values.For example, at s 1x = 3:2, FLD shows that the attributes in the Y-dimension have more categorical information, which is clearly inconsistent with the reality.This is due to the fact that as class 1 continues to expand along the Xdimensional component direction, even though the   mean difference between the two classes in the Y-dimensional component direction is 0, estimates the mean of from a finite number of samples will have a small random Y component which will lead to incorrect indication results.Moreover, if all classes have the same mean, a linear discriminant function cannot be implemented, and for the case of smaller interclass distances measured by the mean difference, the same mean value will lead to serious estimation problems.If the interclass distances are small relative to the standard deviation of the classes, random fluctuation results will occur.On the contrary, the algorithm in this paper has a strong robustness due to the use of explicit class labeling transformation, which exploits the stability of the twodimensional vectors x, class ð Þ and y, class ð Þ distributed in space as shown in Figure 7, avoiding the deficiency of mutual information MI in estimating the probability density on a limited number of samples, and avoiding the influence of similar FLD by the mean and standard deviation, taking into account the supervisory role of class labeling while examining the intra-class to tight inter-class to sparse properties of the attributes.

Real data sets
In the experiments, five binary classification datasets widely used for feature selection algorithms and classifier performance validation will be used, which belong to different domains and have a progressively increasing number of attributes and can be found in UCI (http://archive.ics.uci.edu/ml/index.php).These datasets include: Pima Indian Diabetes, Wisconsin Breast Cancer Database, 30 MUSK ''Clean1'' Database, 31 LSVT Voice Rehabilitation Dataset, 32 Olivetti Faces Database, as detailed in Table 6.
The feature selection algorithms of the same type: MIFS, FS, LS, and reliefF were selected for comparison.MIFS, FS, and reliefF are the higher scoring attributes are more important, and their scores are normalized to min-max; the algorithms in this paper and LS are the lower scoring attributes are more important, and their scores are normalized to min-max after taking the inverse.To evaluate the classification accuracy of the classifier after feature selection, 80% of the data are training data and 20% are test data, and Pima Indian Diabetes dataset.Pima Indian Diabetes was used to study the classification problem between the class label ''diabetic or not'' and eight attributes.It has 768 data of Pima Indian female patients over 21 years of age, 268 of them have diabetes and 500 do not.The proposed algorithm was used to calculate the J_CSH values of the eight attributes shown in Figure 8.The J_CSH values of these attributes, in descending order, were: 2 Glucose, 6 BMI, 3 Blood Pressure, 7 Diabetes Pedigree Function, 8 Age, 1 Pregnancies, 4 Skin Thickness, and 5 Insulin.The top-ranked attributes with J_CSH values were selected as the result of feature selection.The two attributes with the lowest J_CSH values were 2 Glucose and 6 BMI, which is consistent with the conclusion stated by the authors Chen et al., 33 and in line with the opinion of the authors Gong et al. 34 that the most dominant attributes in Pima India Diabetes are Glucose and BMI.The scores of MIFS, FS, LS, relief, and the algorithm in this paper regarding all attributes are shown in Figure 9. MIFS and FS can select the attributes Glucose and BMI that have a major impact on the classification, while reliefF and LS undergo a deviation.The algorithm in this paper, FS_CSH, is not only able to select the important attributes that are effective for classification, but also its ability to point out the categories for attribute differentiation is more obvious.
Finally, the impact of the above algorithms on the classification accuracy was evaluated using LR and SVM classifiers.The average value of accuracy after 20 times of cross-validation of 10 folds of training data is shown in Figure 10; the average value of accuracy after 20 times of classification of test data is shown in Figure 11.The attributes selected by the algorithm FS_CSH in this paper can effectively improve the classification accuracy,  Figure 13(a) shows the accuracy of 10-fold crossvalidation for LR 80% training data, the proposed algorithm outperforms MIFS, relief, and FS algorithms overall, and is very close to the effect of LS algorithm.Figure 14(a) shows the accuracy of LR 20% test data, only in the case of attribute number 2 FS-CSH algorithm corresponds to 92.1% accuracy, which is slightly lower than MIFS, FS, and LS algorithms, but still higher than reliefF algorithm.The accuracy of the FS-CSH algorithm is either second or tied for first place with other algorithms for the number of attributes 1, 3, 4, 5, and 6.
Figure 13(b) shows the classification accuracy of SVM 80% training data with 10-fold cross-validation.When the number of attributes is 1, the accuracy of the proposed algorithm is the same as FS, higher than reliefF, and slightly lower than MIFS and LS; when the number of attributes is 2, the accuracy of FS-CSH is the same as MIFS; when the number of attributes is 3 and 4, the accuracy of FS-CSH is the same as FS and only lower than that of LS; when the number of attributes is 5, the accuracy of FS-CSH is the highest; when the number of attributes is 6, the accuracy of FS-CSH has the same accuracy as LS and is only lower than the accuracy of the reliefF algorithm.Figure 14(b) shows the accuracy of SVM 20% test data.When the number of attributes is 1, the accuracy of FS-CSH is the same as LS, higher than FS and reliefF, and slightly lower than MIFS; when the number of attributes is 2, the accuracy of FS-CSH is slightly lower than MIFS, FS, and LS, but still stronger than reliefF; when the number of attributes is 3 and 4, the accuracy of FS-CSH is the same as FS and reliefF, and the highest; when the number of attributes is 5 and 6, the accuracy of The accuracy rate of FS-CSH still reaches 91.2%.The above analysis shows that the proposed algorithm is effective in the Wisconsin Breast Cancer dataset and can filter out the subset of features that are beneficial for classification.MUSK ''Clean1'' dataset.The Clean1 part of the MUSK dataset contains 476 molecular samples, described by 166 attributes, 207 samples are labeled as ''musk'' and 269 as ''non-musk.''The algorithm of this paper is used to calculate the evaluation function values of 166 attributes J_CSH, as shown in Figure 15.Considering that the MUSK dataset is the structural description of microscopic molecules with linear indistinguishability, the classification accuracy of the above-mentioned feature subsets is examined on K-NN, DT, and SVM, as shown in Table 7.The accuracy of the proposed algorithm is lower than MIFS and reliefF only when the 10-fold cross-validation of 80% training data is performed on DT, and the remaining cases are higher than other algorithms.It can be concluded that the proposed algorithm FS-CSH is effective on the MUSK ''Clean1'' set, and the algorithm can filter the subset of features favorable for classification when the number of attributes is 10% of the total.

LSVT
Voice Rehabilitation dataset.LSVT Voice Rehabilitation was created by Athanasios Tsanas of the University of Oxford, who obtained clinical information on 14 patients from the voice signals provided by LSVT Global.These patients were diagnosed with Parkinson's disease and were receiving LSVT-assisted voice rehabilitation.The set characterizes 126 speech signals using 309 algorithms, that is, it has 126 samples of speech signals, each described by 309 attributes and labeled as ''acceptable'' and ''unacceptable.''Due to the specificity of the data distribution of each attribute in this dataset, it is more challenging to perform operations such as feature selection, feature extraction, and classifier performance validation on this dataset.In this experiment, a two-step preprocessing was performed on this set: 1 Eliminate attributes with small variances.The unbiased estimated variance of s 2 is calculated for each attribute, and the attribute of s 2 \ 0:05 is directly discarded.2 For the attribute with  The evaluation function values J_CSH of the remaining attributes were calculated using the algorithm in this paper, and the values of J_CSH for the excluded attributes were set to 0.2 uniformly, as shown in Figure 16.The two attributes with the lowest J_CSH values were: 153 entropy_log_4_coef and 84 MFCC_0th coef.On LR, K-NN, DT, and SVM, compared with algorithms such as MIFS, FS, LS, and reliefF are shown in Figures 17 and 18.
Figure 17 shows the accuracy of the 10-fold crossvalidation for 80% of the training data for the four classifiers.With the number of attributes of 2, the accuracy of the proposed algorithm is highest on LR and DT, inferior to MIFS, FS, and reliefF on K-NN, and equal to FS and reliefF and slightly inferior to MIFS on SVM.When the number of attributes is greater than 2, the accuracy of the proposed algorithm is comparable to other algorithms, only inferior to MIFS, FS, and reliefF on K-NN.
Figure 18 shows the accuracy of 20% of the test data for the four classifiers.With the number of attributes of 1, the accuracy of the proposed algorithm is   only slightly lower than the MIFS algorithm on DT, and higher or equal to the rest of the algorithms.For the number of attributes .2, the proposed algorithm shows higher accuracy with MIFS, FS and reliefF on K-NN and SVM, better than LS on LR, and indistinguishable from MIFS, FS, and reliefF on DT.In summary, for the high-dimensional complex dataset LSVT Voice Rehabilitation, the algorithm FS-CSH still shows a trustworthy screening ability and its selected feature subset is effective in improving the classification accuracy of the classifier.
Olivetti Faces dataset.The Olivetti Faces set contains a total of 400 face images from 40 subjects with 10 faces each.Each image is 64 3 64 pixels, described as a 4096dimensional vector, and each pixel has 256 gray levels.For this set, determining whether a person is wearing glasses or not is a typical binary classification problem.Before the experiment, this face image data was histogram equalized, and each piece of data was assigned a label of 0 or 1 depending on whether the face in each   image was wearing glasses, with 1 being wearing glasses and 0 being not.
The 4096 attribute (pixel) evaluation function values J_CSH are calculated using the algorithms in this paper, as shown in Figure 19.Then the scores of MIFS, FS, LS, and reliefF algorithms are calculated for each attribute (pixel).The top 32 to 256 attributes (pixels) of each algorithm are selected to construct new feature subsets, and the classification accuracy of the selected feature subsets is checked by SVM.
In Figure 20(a), the average accuracy of the proposed algorithm is slightly lower than the MIFS algorithm when the number of attributes (pixels) is 32 and 64, on par with FS and reliefF, and higher than the LS algorithm; in the rest of cases, the average accuracy algorithm proposed in this paper is comparable to that of MIFS, and higher than other similar algorithms.In Figure 20(b), when the number of attributes (pixels) grows from 32 to 296, the average accuracy rates of FS-CSH, MIFS and reliefF alternately lead: FS-CSH leads four times, MIFS algorithm leads one time and reliefF algorithm leads three times.And they all have higher average accuracy than the FS and LS algorithms.
To visualize the effect of the proposed algorithm on the selection of face pixels, Figure 21(a) shows 119 images of faces wearing glasses out of 400 face data, and Figure 21(b) shows the distribution of selected pixels (marked by black pixel dots) in these face data images.Comparing Figure 21(a) and (b), the pixel points selected by the proposed algorithm are mainly concentrated in the cheek area below the eyes, while avoiding the frame area.The former is influenced by the refractive effect of the eyeglass lens, which makes this part more favorable for distinguishing whether glasses are worn or not; while the latter is influenced by the lateral shadow of the eyebrows and the bridge of the nose, resulting in its ineffectiveness for determining whether glasses are worn or not.Therefore, it can be concluded that the feature selection algorithm FS-CSH proposed in this paper is effective on the set of Olivetti Faces and can filter out the subset of features that are favorable for classification, which is slightly superior to the MIFS and reliefF algorithms that also utilize classification information.

Conclusion
In this paper, a feature selection algorithm FS-CSH is proposed for the binary classification problem.The algorithm explicitly incorporates class labels, goes through SOM and hierarchical clustering, scores each attribute, evaluates their relevance to the category, and filters the subset of features with the maximum decision classification capability that can improve the classification accuracy of the reduced-dimensional dataset.The results of simulation applications on both artificial and real data show the above results.Compared with feature selection algorithms of the same type (i.e., algorithms that examine each attribute for scoring), the one proposed in this paper is a heuristic algorithm with a clear physical meaning, simple computation, effective, and robustness.

Figure 1 .
Figure 1.(a) Strong two-region distribution of A m values, (b) weak two-region distribution of A m values, (c) mixed distribution of A m values, and (d) mixed distribution of A m values with explicit class labels.

Load the samples 2 :
for each A m 2 A do 3: mapping class labels 4: Training the SOM network by Algorithm 2 5: SOM primary clustering by Algorithm 3 6: hierarchical clustering of basic clusters and calculating J CSH Am by Algorithm 4 7: end for 8: Return J CSH = J CSH A1 , J CSH A2 , Á Á Á , J CSH AM ½ topologized in the form of a ''closed curve,'' where any neuron is directly connected to the neurons on either side of it, and the inhibition of the winning neuron on its neighboring neurons extends along both sides of the curve during the ''weight coefficient update'' phase of network training.The Gaussian topology field of winning neuron is shown in Figure 3.The neurons in the output layer after SOM training are distributed along the closed curve.
then it is removed from C and the remaining elements in C are the basic clusters.

Definition 3 :
modules of the cluster C k (i.e. the number of samples in the cluster C k ), C 2 C k k k is the number of combinations of any two of the C k k k samples.Inter-cluster distance: The Euclidean distance between the centers of the clusters C k and C l is the inter-cluster distance, denoted as d ex C k , C l ð Þ:

Figure 7 .
Figure 7. (a) x, class ð Þspace distribution for different s 1x values and (b) y, class ð Þspace distribution for different s 1x values.

Figure 8 .
Figure 8. J-CSH value of features for Pima.

Figure 9 .
Figure 9. (a) J-CSH value of features for Pima, (b) MI value of features for Pima, (c) FS value of features for Pima, (d) LS value of features for Pima, and (e) reliefF value of features for Pima.

Figure 15 .
Figure 15.J-CSH value of features for MUSK.

Figure 16 .
Figure 16.J-CSH value of features for LSVT Voice Rehabilitation dataset.

Figure 19 .
Figure 19.J-CSH value of features for Olivetti Faces.

Figure 21 .
Figure 21.(a) Pictures of people wearing glasses and (b) selected pixels on pictures of people wearing glasses.

Table 1 .
Feature selection algorithm for binary classification based on class labeling, SOM, and hierarchical clustering.

Table 3 .
Primary clustering based on SOM.Output: a set C which contains N Ã basic clusters 1: (Initialization) Initial set

Table 4 .
Cohesive hierarchical clustering of basic clusters.The set C which contains N Ã basic clusters Output: a evaluation function value J CSH Am 1: (Calculating distance) According to the circular topological relationship of the output layer, calculate the Euclidean distance between the output layer neuron and its right adjacent neuron, and store it in D. 2 W x and W y are shown in Table5.In the table, both the attribute evaluation function J CSH of the algorithm in this paper and the mutual

Table 5 .
Comparison of J_CSH, MI and FLD.

Table 6 .
Information about the data set.its ranking of attributes is generally higher than the cases of other feature selection algorithms when different numbers of attributes are required. and

Table 7 .
Average classification accuracy of K-NN, DT, and SVM.The bold entries in each row indicate the maximum classification average accuracy.