The Information Filtering of Gene Network for Chronic Diseases: Social Network Perspective

Web and mobile platforms have provided an environment of technical cooperation through technical development and the diffusion of related devices. Large-scale data sets have been available to analyze web interaction and data analysis. Particularly, large-scale data make us learn new patterns and insight into several research fields. For healthcare field, most chronic diseases are caused by environmental and genetic factors (Van der Laan et al., 2003). The relationship between environmental exposure and gene factors is crucial regarding disease etiology (Swift et al., 2004). For example, Tobacco is considered one of the biggest environmental factors responsible for many diseases each year. Schwartz and Collins (2007) discussed the importance of gene and environment factor correlation in human diseases. Thomas (2010) published a review of different approaches on gene-environment association studies attempting to explain some of the most complex diseases. Although previous studies have studied chronicle diseases with their causes one by one, those studies do not show integrated relationships between various diseases and their related human genes. Therefore, this study investigates the gene-disease relationships which are affected by tobacco and is able to find new association links with social network analysis and other mining techniques.


Introduction
Web and mobile platforms have provided an environment of technical cooperation through technical development and the diffusion of related devices. Based on text mining, largescale data sets have been available to analyze web interaction and data analysis. Particularly, large-scale data make us learn new patterns and insight into several research fields. For healthcare field, most chronic diseases are caused by environmental and genetic factors [1]. The investigation of causal relationships between environment factors and human genes is important regarding disease etiology [2,3]. For example, tobacco is considered one of the biggest environmental factors responsible for many diseases each year. Schwartz and Collins, in 2007 (science), discussed the importance of gene and environment factor correlation in human diseases. In 2010, a previous study published a review of different approaches on gene-environment association studies attempting to explain some of the most complex diseases [4]. Although previous studies have studied chronicle diseases with their causes one by one, those studies do not focus on the integrated relationships between various diseases and their related human genes. Therefore, this study investigated analyzing the gene-disease relationships for tobacco and finding out new association causal links conducting social network analysis and mining techniques.

Chronic Diseases and Data
Mining. Previous studies have investigated the chronic diseases with data mining [5]. To find out how to decrease risk of chronic diseases, many works used data mining technics like case-based reasoning and machine learning [5,6]. According to previous studies, data mining can help improve diagnosis systems and treatment of chronic diseases [6,7]. Data mining technics make professionals improve treatment for patients of chronic diseases. Previous studies have investigated the effectiveness of data mining and its performance for improving clinical data repositories and diagnosis systems [6,8]. Although data mining technics are powerful for treatment and diagnosis systems, there has been no study related to diagnosis of chronic diseases with social network analysis. In addition, this study tried social network analysis with collaborative filtering to find out causal relationship and correlations with gene and chronic diseases.

Social Network Analysis.
Social network analysis (SNA) is the analysis of social relationships based on network theory. Social network consists of nodes (i.e., individual entity within the relationships) and ties which represent relationships between the nodes like human relationship. The social network is a simple and powerful concept in that it can find out types of interaction or connection of users or entities [9]. This aspect constitutes various social phases with familiar or chance subjects. Usually, social network can provide numerous points of advantages by reinforcing the connection between node and node or among network itself. From these ways, the social network concept has been applied in many fields, especially in information system and data analytics [10]. For instance, web-based social websites such as Facebook, Linked-in, and Twitter make efforts to provide specific and diverse social services by determining the relationships of users.
In the health care area, social network analysis helps researchers to understand each disease and human gene relationships based on overall disease-gene network structures [11,12]. This aspect is important because it helps us to understand network structures of diseases and related associational human gene and identify the characteristics of the most influential human gene in the early phase of a disease. The structure of network can make accessibility of a lot of information related to disease-gene relationships available [13] and also increase performance for medical treatment and prevention of diseases by providing the primary causes of diseases. Particularly, social network analysis in diseasesgene networks could increase understanding for a sense of unity by many patterns and elevate the performances for their prevention [14].

Collaborative Filtering.
Collaborative filtering (CF) is called a social information filtering technique because it generates the process of using relationship to decide whether item would link certain items [14][15][16][17]. CF is an algorithm used to study prediction and used primarily as a recommender system [18,19]. We used collaborative filtering because this approach identifies genes related to particular diseases in the collected data.
Collaborative filtering recommends items based on the preference similarities of users in the preference or taste information of many other users [20][21][22]. According to characteristics of CF algorithms, this study tried to apply CF algorithm to the human gene network. SVD (singular vector decomposition for collaborative filtering) algorithm is matrix factorization models to solve collaborative filtering problem [23,24]. SVD maps both users and items with a joint latent factor space of dimensionality. The latent space tries to find similar products or services by comparing users and product information (i.e., descriptions and features of products). SVD assumes that only a small number of factors can influence the preferences. Also, SVD assumes that preference of users on each item is determined by how each factor is related to the user and the item. This can be formulated as a MF problem. Namely, in a -factor model, given the preference matrix ∈ ( × ) (preference matrix ∈ ( × ) can be converted to 2-mode network straightforwardly), SVD finds two matrices: ∈ ( × ) and ∈ ( × ) such that To find matrices and , SVD solves the following optimization problem using stochastic gradient descent: where > 0 is overfitting regulation parameter and = { , | > 0}. More details are in [15,25]. The factorization considers an iterative method based on starting with random initial values for and .

Data Collection and Methodology.
We collected human gene of Symantec type terms with disease names related to tobacco based on their cooccurrence in PubMed abstracts. For our raw data gathering we set the term tobacco as a query and collected 82,538 abstracts in XML [25,26]. Using a Java program, we parse the abstracts and create a text file in a form of PubMed ID/Title/Journal/abstract/year. In the next step, we extracted the Bio Entity from the text file and made a MySQL Database in a form of Extracted Term/ULMS Symantec ST/CUI (UMLS's Unique Identifier Code) and preprocessed the list of diseases. To unify our terms, this study included CUI into the UMLS and corrected as a preferred term. Also, we matched the preferred terms to Gene Ontology and counted only when disease and genes cooccurred (disease-gene: count pair) and made an undirected network.
By doing research procedure, the network contained 479 disease nodes, 869 gene nodes, and 2195 edges. After obtaining the undirected network with weight we had to clarify the network. We used Pearson's correlation to change the heterogeneous network to homogenous network. The disease 2-mode network was changed into 1-mode network to test social network analysis. To find out the correlation among diseases, Pearson's correlation was conducted. The bigger the correlation score is, the higher the similarity between two diseases is. According to the evaluation of the results, this study used SNA (centrality, closeness, and PageRank centrality) and collaborative filtering, clustering PAM.  are directly connected to each other. In the disease-gene networks, "clique" is considered as every disease is directly tied to every gene. "Component" and "clique" are major measures for the cohesion of the network. However, because this study conducted relationships of disease-gene using referred social network measures as in Table 1 and Figure 1, we used centrality measures, not cohesion measures. Among those measures, we firstly tested degree for the network. A degree means the number of links which is connected to other nodes. Closeness centrality is information related to the centrality of point in the network and that information can be measured by closeness or distance between each point in the network as shown in Figure 2 and Table 2. The distance between two points means the shortest distance of the connecting path of two points. The point that has low value of the sum of path distances is the central position of centrality in the network. According to result, the network closeness centralization index is 25.774%. Node betweenness centrality appeared similarly in the viewpoint of difference between two countries' clan networks. Betweenness centrality indicates the number of shortest paths between each node to others. This measure describes the connectivity of the node's neighbors. Thus, betweenness centrality generates higher central score when nodes connect node clusters in the entire network. The measure reflects the degree of the fact that each disease    Table 3, we compared our result with CDC (Centers for Disease Control and Prevention). In data analysis, many genes are related to cerebral palsy and lung diseases whereas lung and heart diseases are in high ranking in the CDC report. Thus, the correlation results among genes and diseases provided the important implications for which genes affect increasing the risk for some diseases.  According to SNA result, this study compared the outcomes with actual diseases to evaluate performance for clustering results in this study by conducting clustering PAM (Partitioning Around Medoids) [26,27]. Clustering refers to the process dividing the data set into some clusters. Clustering methods have two ways: partitioning and hierarchical. Partitional clustering is determining -clusters make optimal cluster function based on Euclidean distance. With its specific, there are -mean and -medoids.

Results
The -means clustering describes vector quantization originally from signal processing and this method is popular for cluster analysis in data mining. -means clustering classifies objects on a set of user selected characteristics [28,29]. This results in a partitioning of the data space into shortest area from a point. Meanwhile, the -medoids are used as a clustering algorithm related to the -means and the medoidshift algorithm [14,30,31].
Both algorithms generate partitions through breaking data set up into each group and minimize the distance between all points and a point as a center of cluster [30]. In contrast to the -means, -medoids choose data points as centers and consider an arbitrary matrix for distances among data points [2].
The most common realization of -medoids clustering is based on the Partitioning Around Medoids (PAM). PAM is considered to initialize clusters by randomly selecting k of the data points as the medoids. Associate each data point to the closest medoid. -medoids generate better performance than -means as shown in Figure 3 at some situation. Figure 3 demonstrated that in the asymptotic of large-scale data sets the -medoids take less time. In this study, we optioned the -medoids as Symmetrize (method = "MAX"). Number of medoids (clusters) is 25, maximum number of swaps is 1000, and proximity is set as similarity. According to the result, sum of distances to the nearest medoid is 1,301.446. Average of distances to the nearest medoid is 2.717 and maximum of distances to the nearest medoids is 5.062.
To test relationships for diseases-gene network as in Figure 4, this study used collaborative filtering. The number of features (rank) was 10 and the number of items to the recommendation was 10. If |the training error at iteration − the training error at + 1 iteration| < Convergence Tolerance, the algorithm is stopped. Proportion of validation set was 10.0%. A validation set is a portion of a data set to evaluate the performance of prediction that has been fitted on a separate portion of the same data set (the training set) as shown in Table 4 and Figure 5. Both the training and validation sets are randomly selected. The proportion of validation data set can be set by this option.

Conclusion
The purpose of this study is analyzing and understanding the relationship between diseases-gene networks using various mining methods based on biofield research articles. To identify the research purpose, we collected 82,538 abstracts for research papers in PubMed, querying tobacco. After cleansing data set, we extracted 479 diseases nodes and 869 gene nodes. Also 2195 links were also extracted in the rule which same appearing in an abstract. Based on this information, we conducted social network analysis, clustering, and collaborative filtering. With degree scores which were weighted as equally appearing in previous papers, we gave the order and compared diseases with CDC report. Using Pearson's coefficient, we also compared gene and disease network with closeness, betweenness, and PageRank centrality. Also, we evaluated how each gene was clustered with disease and recommend new genes related to diseases except for the relationships between disease and gene.
This study focused on the data set and its analysis with SNA and collaborative filtering. But the future study needs to find out more detailed knowledge for evaluating human genes with some experts. Also egonetwork analysis for disease related to tobacco still needs to be tested to see which gene factors affect the attack of a disease in the first stage of a disease.

Disclosure
This work was presented at ISBSS Conference 2014.