HackerRank: Identifying key hackers in underground forums

With the rapid development of the Internet, cybersecurity situation is becoming more and more complex. At present, surface web and dark web contain numerous underground forums or markets, which play an important role in cybercrime ecosystem. Therefore, cybersecurity researchers usually focus on hacker-centered research on cybercrime, trying to find key hackers and extract credible cyber threat intelligence from them. The data scale of underground forums is tremendous and key hackers only represent a small fraction of underground forum users. It takes a lot of time as well as expertise to manually analyze key hackers. Therefore, it is necessary to propose a method or tool to automatically analyze underground forums and identify key hackers involved. In this work, we present HackerRank, an automatic method for identifying key hackers. HackerRank combines the advantages of content analysis and social network analysis. First, comprehensive evaluations and topic preferences are extracted separately using content analysis. Then, it uses an improved Topic-specific PageRank to combine the results of content analysis with social network analysis. Finally, HackerRank obtains users’ ranking, with higher-ranked users being considered as key hackers. To demonstrate the validity of proposed method, we applied HackerRank to five different underground forums separately. Compared to using social network analysis and content analysis alone, HackerRank increases the coverage rate of five underground forums by 3.14% and 16.19% on average. In addition, we performed a manual analysis of identified key hackers. The results prove that the method is effective in identifying key hackers in underground forums.


Introduction
In the current cybersecurity situation, it is increasingly difficult to guard against advanced attacks or exploits. Hackers have a lot of funds, superb technology, and rich experience. They could not only improve their attack techniques but also are good at finding the weak point in the real enterprise network, including management and personnel. 1 In the face of such complex network attack and defense status, one way to deal with problems is to identify key hackers and then mine emerging cyber threats.
At present, surface web and dark web contain numerous underground forums or markets, which play an important role in the cybercrime ecosystem. 2 These underground forums are popular places for hackers to conduct activities such as learning, communication for information, vulnerability disclosure, tools exchange, and also a distribution center for cybercrime. 3,4 Many forums are also dedicated to providing underground transactions for trading malware, information theft, and other services. 5 Therefore, many cybersecurity researchers focus on hacker-centered research on cybercrime, trying to find key hackers and extract credible cyber threat intelligence from them. 6 The data scale of underground forums is tremendous and key hackers represent only a small fraction of underground forum users. Identifying key hackers in such a situation is a great challenge. It takes a lot of time as well as expertise to manually analyze these key hackers. Therefore, it is necessary to propose a method or tool to automate the analysis of underground forums and identify key hackers involved.
In existing research, two main methods have been used to identify key hackers in underground forums: content-based analysis [7][8][9] and social network-based analysis. [10][11][12] Content-based approaches analyze user data based on selected evaluation metrics, such as activity and content quality. Social network-based approaches build a social network on an underground forum in which key hackers have a high degree of network centrality, with common approaches including degree centrality, eigenvector centrality, and PageRank. In general, content analysis (CA) is relatively comprehensive but complex. Social network analysis (SNA) can directly reflect the posting frequency and relationship of users. It is more objective but ignores users' attribute information.
In this work, we present HackerRank (HR), an automatic method for identifying key hackers. HR combines the advantages of CA and SNA. First, evaluation metrics of underground forum users are computed to generate a comprehensive evaluation. Second, topic analysis of the data generated by users is performed to obtain their topic preferences. Finally, an improved Topic-specific PageRank algorithm is used to fuse the comprehensive evaluation and topic preferences for SNA to obtain a ranking of users, with higher-ranked users being considered as key hackers. To demonstrate the validity of our method, we applied HR to different underground forums separately, comparing it with the method using CA or SNA alone. Besides, we performed a manual analysis of identified key hackers. The results prove that our method is effective in identifying key hackers in underground forums.
The specific contributions of this work are the following: This article proposes a framework for automatically analyzing key hackers in underground forums. HR can automatically collect data from underground forums and analyze key hackers among them. Key hacker identification combines methods based on CA and SNA. This method first extracts the user's comprehensive evaluation metrics and topic preferences based on CA and then applies our improved Topic-specific PageRank for SNA. In order to verify the effectiveness and portability of HR, we conducted experiments on five popular underground forums, and the results showed that the user coverage was higher than only using CA or SNA.
The rest of this article is organized as follows. Section ''Related work'' presents related work. Section ''Methodology'' details the implementation process of the HR framework. Section ''Experiments'' presents the experiments and analyses. Section ''Conclusion'' summarizes the conclusion and proposes future works.

Related work
We review existing works from two perspectives, including research on underground forums and key hacker identification. Key hacker identification is a branch of research on underground forums.

Research on underground forums
Due to the increasing link between underground forums and cybercrime, researchers have conducted many studies on underground forums. Related research includes the identification of underground forums, extracting cyber threat intelligence, hacker assets, and so on. Du et al. 13 proposed a method for systematically identifying and automatically collecting a large-scale of underground forums, carding shops, Internet Relay Chat (IRC), and Dark Net Marketplaces. Samtani et al. 14,15 analyzed hacking assets within underground forums that can identify the tools which may be used in a cyberattack, provide knowledge on how to implement and use such assets. They developed AZSecure Hacker Assets Portal, which uses the latest machine learning technology to collect and analyze malicious assets from online hacker communities. Deliu et al. 16 explored the potential of machine learning methods to rapidly sift through underground forums for relevant cyber threat intelligence using text data from real underground forums. Benjamin et al. 17 combined machine learning methods with information retrieval techniques to build an automated method for identifying tangible and verifiable evidence of potential threats within underground forums, IRC channels, and carding shops.

Key hacker identification
Existing methods for identifying key hackers fall into two main categories: content-based and social networkbased analysis.
Users of underground forums generate a lot of data, such as created threads, posts, comments, and uploaded attachments. Content-based analysis refers to mining these data [18][19][20] and constructing user evaluation metrics to discover key users among them. Common evaluation metrics include activity level, content quality, and so on. Different studies have chosen different evaluation metrics. For example, Marin et al. 7 analyzed content features, seniority features, and social network features among underground forums. They used an optimization meta-heuristic to identify key hackers and proposed a systematic method based on reputation to validate the results. Fang et al. 8 developed a framework with a set of topic models for extracting popular topics, tracking topic evolution, and identifying key hackers with their specialties. They identified key hackers in each expertise area by utilizing Latent Dirichlet Allocation (LDA), Dynamic Topic Model, and Author Topic Model. Zhang et al. 9 analyzed the knowledge transfer of user posts in underground forums and classified users into four types: expert, casual, learning, and novice hackers. Expert hackers act as key knowledgeable and respectable members in the communities, increasingly acting as knowledge providers. Contentbased analysis builds metrics that directly reflect the influence of users by mining user data from underground forums. Although content-based analysis is very comprehensive, it is more complicated and the selection of evaluation metrics requires professional participation and verification.
In contrast to content-based analysis, social networkbased analysis focuses on user interactions in underground forums. [21][22][23] User behavior in underground forums is used to construct a social network graph, which is then used to identify key users using graph-based analysis. 24,25 In general, key hackers have high network centralities, such as degree centrality, eigenvector centrality, and PageRank. Pete et al. 26 utilized network centrality analysis to highlight the structural patterns of each network to identify important nodes and key hackers. Zhang et al. 10 proposed a new heterogeneous information network (HIN) embedding model named ActorHin2Vec to learn the low-dimensional representations for the nodes in HIN, and then a classifier was built for key actor identification. Grisham et al. 11 used a state-of-the-art neural network architecture model to identify mobile malware attachments and then social network-based analysis techniques to determine key hackers disseminating mobile malware. Samtani and Chen 12 analyzed user interactions by leveraging metrics such as network diameter and average path length, and quantified the importance of each user using centrality measures. Social network-based analysis is common across different social platforms but ignores information about the attributes specific to underground forum users. Different from these above works, we combine the advantages of content-based and social networkbased analysis to build a framework for automated analysis of key hackers in underground forums.

Methodology
In this section, we describe HR in detail, a framework for automatically analyzing key hackers in the underground forums. The high-level design of HR is illustrated in Figure 1. Data Collection and Preprocess collects the content of the underground forums and preprocesses the collected data. Social Network Construction generates a social network graph based on the interaction among users. Key Hacker Identification combines analysis based on content and social network. Content-based analysis constructs a comprehensive evaluation based on the user characteristics of underground forums and analyzes the users' topic preferences based on the LDA model. Then, we perform SNA through the improved Topic-specific PageRank algorithm based on the results of CA and generate users' influence. Finally, we get the Top K key hackers from the ranking based on their user influence.

Data collection and preprocess
In this section, we collect the content from underground forums and users' interaction. In underground forums, discussions are all organized as threads (i.e. a user initiates a thread and create a post, then other users reply it, discussing various hacker-related information posted by community members). While crawling the data of forums, we also collect them like this. In other words, we get all the threads from the forum first, and then we collect all the posts under the thread, including the username, profile, content, order, and time of the post. In addition, we also consider some mechanisms to deal with the anti-crawler mechanisms of the underground forums.
As for the crawled raw data, the data are not wellformatted. In order to perform the text analysis better, we conduct data preprocessing here. First, we convert all the data to lowercase to keep the data format consistent. Second, we delete non-ASCII characters and punctuation marks. Finally, we use the natural language toolkit (NLTK) 27 module to segment the text and delete the stop words. Also, word lemmatization is necessary here.

Social network construction
SNA studies the relationship between social entities based on graph structure. In a graph, there are two components: nodes and edges. Here, the nodes represent the user of underground forum, and the edges represent the social relationships among users.
The social network graph is displayed in Figure 2. We define the graph as a directed weighted graph G = (V , E), where G represents a weighted directed graph, V represents a vertex set, and E is the edge set. In underground forums, each user in the underground forum represents a vertex v i 2 V . If \v i , v j . 2 E, it means that there is an interactive relationship between user v i and user v j . The weight W of the edge is the number of interactions between users. For example, in Figure 2, there is an edge weight of \v A , v D . in user A and user D with w AD , which means that user A has replied to user D's post with a frequency w AD . What should be noted here is that the thread initiator initiates a thread, and other users discuss it in this thread in underground forums. By default, other users' replies are for the thread initiator, and the connection should be established with the thread initiator. However, there are also some situations that users discuss with others directly in the thread. In this condition, the connection should be established according to the reply object specified by the user.

Key hacker identification
User evaluation metrics construction. In order to dig out the relevant features and behaviors of key hackers, there have been various works to explore and study the users' characteristics of underground forums or online forums. As shown in Table 1, we summarize the common features. The related works mainly portray users from three aspects, including activity, content quality, and knowledge dissemination ability. Activity is reflected by the number of posts, the more active the user, the more the number of replies and threads in the forums. Users with high-quality speeches have longer posts, and also involve a lot of hacker jargons, technical jargons, and threat intelligence. In addition, users' interaction is usually along with knowledge transfer (knowledge acquisition and provision), and key hackers are often the core of knowledge transfer.
Based on the previous works, 8,9,18,19,28-31 we construct a user evaluation metric system based on CA, and extract some features from the collected data as users' evaluation metric. According to the characteristics of entropy, calculating the entropy value could evaluate the randomness and disorder of an event, or the degree of dispersion for some metric. The more discrete the metric, the greater the influence (weight) of the metric on the comprehensive evaluation. Therefore, we adopt entropy weight method 32 to assign weights to various metric to generate a comprehensive evaluation for each user. The calculation process is as follows: Data standardization: as illustrated in equation (1), we use minimum and maximum method to standardize the data since the measurement units of various indicators are not uniform, and the data dimensions and data levels are quite different. In equation (1), x ij represents the jth metric of the ith user, maxx j is the maximum value of the jth metric, and minx j is the minimum value Calculate the information entropy of the jth metric where k=1=ln(n) and p ij = x ij = P n i = 1 x ij . Calculate the weight of each metric where m is the count of metrics. Perform a weighted summation of the weights of each metric to generate a comprehensive evaluation of underground forum users as LDA-based underground forum topic discovery. In this section, we build a topic discovery model to analyze users' topic preferences. We use the LDA algorithm for topic modeling, which is actually a three-layer Bayesian probability model containing words, document structure, and topics. 33 If a document is considered as a set of word vectors, then for a document, the document and topic satisfy a polynomial distribution, and the words in the topic and vocabulary also satisfy a polynomial distribution. The two polynomial distributions are both Dirichlet distribution with hyperparameters a and b. As for the document, we just consider whether a word appears, rather than the order of its occurrence. In LDA model, a document is generated as Figure 3, and the process is as follows: Take samples from the Dirichlet distribution a to generate the topic distribution u i of document i. Take samples from the topic polynomial distribution u i to generate the topic z i, j of the jth word for document i. Take samples from the Dirichlet distribution b to generate the word distribution u z i, j of the topic z i, j . Take samples from the words polynomial distribution u z i, j and finally generating words w i, j .
In underground forums, users usually post more than once. In order to understand the user's topic preference, we group one's all posts into a document d. Through LDA, we could get the probability distribution of words on the topic (equation (5)), the probability distribution of the article on the topic (equation Table 1. Content analysis metrics.

Category
Feature Description Activity Start topics 8,28 Total number of topics created by the hacker Start replies 8,28 Total number of replies created by the hacker Content quality Length of topics 19,29 The average length of the thread created by the hacker (i.e. the number of words contained) Length of replies 19,29 The average length of the replies created by the hacker (i.e. the number of words contained) Length difference 18 The ratio of the length of the reply post to the length of the topic post Technical jargon 18 Count of technical terms included in the post such as computer and program Hacker jargon 30 Count of posts including hacker jargons such as Attack, penetration, XSS, and SQL inject IOC share 31 The number of IOCs included in the post, which indicates that hackers may participate in cybercrime or share resources, including IP, Hash, domain name, and so on Knowledge dissemination ability Replies with knowledge provision 9 The number of knowledge-providing keywords contained in the reply post, such as answers, guide recommend, and follow Replies with knowledge acquisition 9 The number of knowledge acquisition keywords contained in the reply post such as request, need, and doubt Topics with knowledge provision 9 The number of knowledge provision keywords contained in the thread Topics with knowledge acquisition 9 The number of knowledge acquisition keywords contained in the thread (6)), where C wk represents the times that the word w is assigned to the topic K p kjw ð Þ= To train LDA model, the election of the number of topics is essential. At present, perplexity and coherence are often used to determine the number of topics. Perplexity means that ''for a document, how uncertain the LDA model is that it belongs to a some topic.'' The more topics, the lower the perplexity, 33 but the model is more likely to be over-fitting. So, when understanding the approximate range of the number of topics from the perplexity, coherence 34 can be used to select more suitable topics from this range. The calculation of perplexity is illustrated as follows where M represents the count of documents in text sets, N d is the length of document m, and p(w d ) represents the probability of text. The coherence can be calculated as follows where D w j is the document frequency of word w j , and D(w i , w j ) represents the co-document frequency of word w i and w j . 35 We choose the best number of topics to train the LDA model through the comprehensive assessment of coherence and perplexity.
SNA based on improved Topic-specific PageRank. In sections ''User evaluation metrics construction'' and ''LDAbased underground forum topic discovery,'' we construct user comprehensive evaluation metrics and topic preferences based on CA. In this section, our algorithm is improved from the Topic-specific PageRank algorithm. 36 In our method, we combine the results of the above CA for SNA. Then, we obtain the final user influence value, the HR value.
According to the social network diagram constructed in section ''Social network construction,'' its weight is the number of interactions between users. Since the user's influence is different, we need to consider the asymmetric delivery of each node (user). Here, we define the weight of the edge in the social network graph as equation (9) where U j is the comprehensive evaluation based on user j's activity, posts content quality, and knowledge dissemination ability. w ij is the interaction frequency between user i and user j. Next, we construct a transition matrix; the transition of user's state (i.e. the user will communicate with which user next time) is related to the current state, but not the past state. For user j, each user i pointed to by the outgoing link has M ij = N ij = P k N jk . Each user has the probability of a to communicate with other users next time. At this time, the users rank can be presented as equation (10) Based on the LDA topic discovery model mentioned in section ''LDA-based underground forum topic discovery,'' in HR, we first use a series of topics to generate the topic vectorṽ (ṽ is used to record the relationship among all users and topics, each topic maintains aṽ vector). Let K j be the user set in a topic T j , then when calculating the PageRank vector of topic T j , replace the uniform damping vector p = ½1=n n 3 1 Figure 3. A document's generation in LDA model.
As mentioned above, we have generated a set of topicspecific Rank vectors, which could basically measure the user's influence in each topic. In addition, we refer to the approach of Weng et al. 37 to get the overall influence of users, and calculate a weight r t for the ranking under topic t. Besides, we need to build a matrix WT with dimension W 3 T , where W is the word frequency of a topic, and T is the count of topics. WT ij represents the times that the word w i is assigned to the topic t j .
The following formula is used to calculate r t In summary, the calculation of the user's overall influence is shown in equation (13) HackerRank Experiments

Data sets
In this study, we conduct experiments through five different mainstream underground forums. According to the data collection methods described in section ''Data collection and preprocess,'' the crawler is designed and developed. Since each forum has a different structure, we adapt it on each forum. The data set is shown in Table 2. In addition to the data we collected, the ''Nulled'' forum also contains the data leaked in 2016.

Analysis of LDA experimental results
In the process of key hacker identification, we choose LDA topic model to extract users' topic preferences. Instead of training the LDA model for each underground forum separately, we use all the data in Table 2 to train a general model suitable for underground forum topic analysis. During the training of the LDA model, choosing an appropriate topic number has a great influence on the model. In this article, coherence and perplexity are the indicators we choose to evaluate the performance of the model. In the experiment, the topic number is set to 2-10 (interval 1) and 15-50 (interval 5). Figures 4 and 5 show the curve of coherence and perplexity under different topic numbers, and in Figure 5, when the number of topics ranges from 2 to 10 (step = 1), the change in perplexity is on the upper right.
In Figure 4, when the number of topics is 5, coherence reaches the maximum value, and the number of topics ranges from 5 to 50, the value of coherence decreases as a whole. As can be seen in Figure 5, the number of topics ranges from 2 to 5, and the perplexity shows a downward trend. When the number of topics   goes from 6 to 8, the perplexity increases slightly; the number of topics ranges from 15 to 50 (interval 5), the perplexity is stable at 670 to 720, and the trend of change is relatively gentle. Although the perplexity is not the minimum when the number of topics K = 5, it is already a local minimum, and when the number of topics increases, the trend of perplexity is very small. Combining the results of Figures 4 and 5, we choose the number of topics K = 5.
Using the trained LDA topic model, we extract the five most representative words under each topic. As shown in Table 3, we summarize the topic name and representative words of each topic.

Effect of HR
Comparison with related algorithms. To validate HR, we set up comparison experiments. HR combines CA and SNA, so we compare methods that use CA or SNA alone.
CA: users are ranked according to their comprehensive evaluation in section ''User evaluation metrics construction.'' SNA: users are ranked according to their PageRank value.
In the above methods, the damping factor has a large effect on PageRank and HR, which is a balancing parameter between the effectiveness of the algorithm and the speed of convergence. In the experiment, the damping factor is set to 0.85, which is an empirical value. With a damping factor of 0.85, it can converge to the PageRank vector in about 100 iterations. When the damping factor is close to 1, the number of iterations required will increase abruptly, and the sorting will be unstable.
Kendall correlation. Kendall correlation is used to measure the correlation between two random variables. The value of Kendall correlation t ranges from 21 to 1. Two sequences are exactly the same when t = 1. Two sequences are opposite when t = À 1. The greater t is, the higher correlation between two sequences. In this section, we analyze the correlation between HR and rank lists generated by CA and SNA through Kendall correlation. We find the same trend in different underground forums. As shown in the Kendall correlation in Table 4, HR has a difference in the rank list generated by other methods. At the same time, it can be observed that the correlation t of HR versus SNA is higher than HR versus CA. This is because different methods use different characteristics and analysis methods to evaluate user influence.
Coverage analysis. To validate the effectiveness of HR, we evaluate HR using coverage, 38,39 which is commonly used in the field of key user identification, as an evaluation metric. Coverage measures the effectiveness of key user identification from the network topology formed by user interactions, by counting the number of affected users.
This article compares the coverage of three methods on underground forum top 50 key hackers. To fully validate the performance of HR, the experiments are conducted on five different underground forums. As shown in Figure 6, HR's coverage of top 50 hackers in all five underground forums is higher than that using SNA or CA alone. Specifically, compared to using SNA and CA alone, HR has increased the coverage rate (coverage number=total number of forum users) of five underground forums by 3.14% and 16.19% on average, which proves the validity and portability of our method. It can be seen from Figure 6 that the HR coverage curve increases rapidly from 1 to 20, and then the growth rate slows down. The top 20 hackers have correlated most of the users in the forum, which shows that in underground forums, a small fraction of key hackers has high influence. In addition, it can be observed that the effect of only using CA is poor. This is due to the fact that CA only considers the text features of users but ignores the interaction among users.
Key hacker identification results. In this section, we show the top five key hackers for each forum obtained using HR, SNA, and CA, as shown in Table 5. It can be seen that the results obtained by the different methods have some similarities as well as some differences. In order to better verify the effectiveness of HR, we manually checked the above results. Taking the Nulled forum as an example, Table 6 shows the top five key hackers and the results of the three analysis methods. Here, we analyze the top five key hackers. ''Zaida'' hacks into a large number of accounts (such as mailboxes) and sells them publicly in the forum, attracting a large number of buyers to conduct transactions.   ''Veterun'' often publishes high-quality hacking tutorials in the forum and shares related hacking resource links. At the same time, he also conducts in-depth technical exchanges with other users in the forum. ''Psych0path'' is engaged in software cracking and private data transactions. It has completed up to 880 transactions in the forum and has a high reputation. ''K33P0'' is very active under themes such as games (such as CSGO) and digital currencies (such as BTC, ETH, and LTC). ''Nord'' focuses on program cracking and participates in activation key trading activities, and has released many illegally obtained program keys. It can be seen from the above analysis that key hackers not only have high social network influence but also the content they publish also has high-quality and distinctive topic preferences. Therefore, HR can more accurately identify key hackers based on CA and SNA.

Conclusion
In this article, we propose a key hacker identification framework for underground forums, HR. This framework combines CA and SNA. First, we mine the user characteristics of underground forums and construct a comprehensive evaluation. Second, the LDA model is used to predict users' topic preferences. In SNA, user influence is obtained using an improved Topic-specific PageRank algorithm based on comprehensive evaluations and topic preferences. Through user influence ranking, we can identify key hackers in underground forums. In our experiments, we compare HR with methods that use CA or SNA alone. The results prove that HR has a significant advantage in identifying key hackers. At present, HR can identify key hackers based on historical data of underground forums but lacks consideration of forum evolution. Also, HR can only identify key hackers in a single forum. In the future, we will work on building a real-time key hacker identification framework based on dynamic graphs and study the identity linkage across different forums.