Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection

Social networking has been used widely by millions of people over the world. It has become the most popular way for people who want to connect and interact online with their friends. Currently, there are many social networking sites, for instance, Facebook, My Space, and Twitter, with a huge number of active users. Therefore, they are also good places for spammers or cheaters who want to steal the personal information of users or advertise their products. Recently, many proposed methods are applied to detect spam comments on social networks with different techniques. In this paper, we propose a similarity-based method that combines fingerprinting technique with trie-tree data structure and meet-in-the-middle approach in order to achieve a higher accuracy in spam comments detection. Using our proposed approach, we are able to detect around 98% spam comments in our dataset.


Introduction
In the last few years, social networking has been known as an Internet phenomenon.It has become the main way for people to connect and keep track with their friends online.The most popular social networking sites such as Facebook, Twitter, and My Space are consistently among the top 20 most viewed websites on the Internet.Many people spend more and more time for enjoying the virtual lives on the social networks rather than their real lives.Moreover, their personal information that is stored and shared on such sites is usually under loose security.Hence, social networking is also a potential target for spammers and cheaters who want to advertise their products or more dangerously steal the users' information.There are many simple tricks, for example, posting fake updates that contain malicious links, abusing the comment function to post unsolicited messages to users, images trick, and social engineering, with which spammers can achieve their purposes easily.
Spam comments usually have duplicate or near-duplicate contents.Therefore, they can be detected by several common methods that are used to detect duplicate and near-duplicate documents in the web mining field.
Duplicate and mirror web pages are seen in plenty in the World Wide Web [1].Besides that, near-duplicate documents are mostly identical to the original ones but differ in several small portions of document such as advertisements, timestamps, or counters.
Recently, duplicate and near-duplicate documents detection is important in various computer science fields, specifically data mining, information retrieval, and web mining.Its advantage is saving storage for necessary data instead of those duplicated.A sizeable percentage of web pages are found to be near duplicate by several studies [2][3][4].These studies suggested that approximately 1.7% to 7% of the web pages visited by crawlers are near-duplicate pages.Although the problem due to mirroring and plagiarism is detected simply by applying several techniques such as machine learning and document clustering, near-duplicate documents are more difficult to identify.
In this paper, we propose a method using trie-tree data structure to store a set of 64-bit strings, each of which is the fingerprint of a web document.After that, we use meetin-the-middle approach in order to detect near-duplicate documents.In social spam detection issue, we believe our method described in this paper is capable of identifying spam comments as well as near-duplicate documents with high accuracy level.
The rest of the paper is structured as follows: Section 2 reviews the related works; Section 3 describes the proposed methods that we utilize to identify duplicate and nearduplicate documents; Section 4 reveals our datasets and evaluation.Finally, we offer some concluding thoughts in Section 5.

Related Works
Many researchers were early to realize the difficulty and importance of near-duplicate web pages detection in web mining field.The proposed methods in such previous works are either similarity-based or signature-based [5].
As for similarity-based methods, they usually require that all documents are compared with each other.Specifically, every document is compared to all others in the dataset and the similarity between each pair is calculated [5].
In 1997, Broder et al. [6] analyzed the fraction of web pages which are near-duplicates of others.In that study, they implemented a technique called shingling.The basic idea of their method is to choose a contiguous subsequences set of every document's tokens.The set of these tokens is used to represent each document.In their experiment, they compared the representative of two documents for overlap to provide an approximation to similarity.Higher similarity would be assigned in two documents that have substantial overlap in their tokens.
Narayana et al. [7] had presented a method for nearduplicate detection of web pages in web crawling.After obtaining a new web page from web crawler, the system extracts the content of that page into many tokens and calculates its similarity score with many various existing documents.A document would be considered a near-duplicate web page if its similarity score was greater than the threshold that had been predefined.
By applying the similarity-based method, the runtime performance is ( 2 ) in which  is the number of documents.Because of that reason, the performance of those methods is usually lower than others if they are implemented with massive datasets.Several techniques were proposed to improve the runtime performance.The authors in [8] used documents' size in order to decide which documents would be compared to which others.
In another paper, we proposed a method to detect spam SMS on mobile devices and smart phones.That approach was based on improving a graph-based algorithm and utilizing the KNN algorithm-one of the simplest and most effective classification algorithms in order to improve the accuracy and performance of detection system on mobile devices [9].
In one example in the fingerprinting context, some frequently occurring shingles are eliminated [5].In this study, the shingles technique considers a document as a stream of tokens, which is broken into overlapping or nonoverlapping segments referred to as shingles.Instead of using the complete set of tokens from each document to compute the similarity between two of them, they chose a small subset of tokens that contained the most frequent ones as the representative of that document.With such simplification, however, the detection accuracy is improved sufficiently.
Another example was based on the fingerprinting to detect near-duplicate documents, which was recently proposed by Kumar and Govindarajulu [1].The fingerprint of document, which is a 64-bit string, is generated by a fingerprinting algorithm called sim-hash [10].In computer science, a fingerprinting algorithm is a procedure that maps an arbitrarily large data item (in this paper, the data item is a web document) to a much shorter bit string that is likely to identify the original data.The sim-hash algorithm can hash similar documents to similar fingerprints and each document can be represented by only 64 bits.Then, -Means clustering, sentence feature, and fingerprint comparison would be applied in order to detect near-duplicate web page document.Pugh, who worked at Google, surmised that two documents can be detected as duplicate or near-duplicate documents if any of their fingerprints match [11].
Following the fingerprinting context, we propose a new social spam comments detection method that utilizes trietree data structure for fingerprints and meet-in-the-middle approach in the detection phase, which can be applied to the spam comments detection process.The experimental result shows how our method achieves high detection accuracy and efficient performance.

Proposed Method
3.1.Workflow.This section describes an overview of our proposed method.A new document has to be processed through many steps such as parsing, tokenization, stop words removal, and stemming in the preprocessing phase.After that, the most frequent tokens set is chosen and used to generate the 64-bit fingerprint by sim-hash algorithm.Furthermore, from a primary fingerprint of document, we continue to invert the value of one or two positions among 64 bits sequentially for generating many 64-bit strings as nearsimilar fingerprints.Finally, the primary fingerprint and its corresponding near-similar fingerprints will be stored in the trie-tree.Generally, our workflow is illustrated in Figure 1.
For the detection phase, we generate fingerprints from the new document, including their primary and near-similar ones.Then, we check whether the fingerprints of new document exist on the trie-tree.If any item exists on the tree, it can be concluded that at least one document in the collections is near-duplicate or duplicate with this.

Preprocessing.
Preprocessing needs to be done prior to fingerprints generation and trie-tree representation.It consists of HTML parsing, tokenization, stop words removal, and stemming [1].In parsing, a web document is analyzed to a linear representation according to a given grammar.After parsing HTML, the parsed web content is broken up into  many words through tokenization process and then they are filtered by removing several connecting words such as "is, " "a, " and "an" in the stop words removal procedure.Finally, a process called stemming condenses these filtered words into their basis form before passing them to the fingerprint generating process for further processing.

Parsing.
A web page is a mixed textual content that includes text, HTML tags, and JavaScript code.Parsing process is the procedure of analyzing the document into a linear representation according to a given grammar [12].It helps to tidy up the HTML tags and JavaScript code to make the content cleaner and more useful before extracting information from the content of document.

Tokenization.
This process breaks a stream text up into many words, phrases, or other meaningful elements called tokens.After tidying up HTML content by parsing process, the extracted content is tokenized to many words.The amount of those words can be reduced without losing the document meaning by filtering and removing many popular linking words in the process called stop words removal.

Stop Words Removal.
In computing, stop words are such words that are filtered out prior to, or after, processing of natural language data (text) [13].They are usually the common words such as "the, " "a, " "an, " and "of" as in Table 1 that appear many times in document's content.Stop words removal is the process that filters and removes out such kind of words in order to improve the algorithm performance.

Stemming.
In information retrieval, stemming is the process of reducing inflected words to their stems, base, or original form of word.Stemming programs are commonly referred to as stemming algorithms or stemmers.Ingason et al. [14] attempt to convert a word to its linguistically correct root which ultimately facilitates the reduction of all words that possess an identical root to a single one.This is obtained by removing each word of its derivational and inflectional suffixes [4].For example, "went, " "goes, " and "gone" are all condensed to "go"-their base form.

Primary Fingerprint of Document.
A fingerprint of document is a hash value of its features.It is the 64-bit string that is generated by sim-hash algorithm [15]-a special hash function.
Hash function is an algorithm or program that maps data from a variable to a fixed length.The values returned by hash function are called hash values, hash codes, hash sums, checksums, or simply hashes.Most hash functions usually generate totally different hash values even for similar inputs.
For similarities detection purpose, sim-hash was proposed.This is a special hash function which was developed by Charikar [15].This hash function is utilized to hash similar inputs to similar hash values.
Initially, a document is preprocessed to extract a set of keywords (tokens) from its content.We initialize an dimensional vector  with each dimension as zero.Each keyword is tagged with its frequency (the number of times it appears in the contained document).Here, for each keyword of document, it is hashed to -bit hash value.The increase or decrease in the  components from these  bits depends on the weight of that word.In the next step, the components' sign determines the corresponding bits of the final fingerprint of document.The working procedure that applies sim-hash to generate a document to a 64-bit fingerprint is illustrated in Figure 2 and the pseudocode of the sim-hash algorithm is given in Algorithm 1 [1].

Near-Similar Fingerprints Extraction.
A document is represented by a unique fingerprint and the fingerprints of near-duplicate documents are usually different from each other in a few bits.To enrich the number of fingerprints, we extract near-similar fingerprints based on the primary one of (1)     a document.These fingerprints have a few bits of difference from the primary one.The  positions in the 64 bits whose values are inverted in order to extract the near-similar ones are computed by calculating the combinations of .The combinations can be taken from a set of size 64 using the following formula: where  is the number of different bits between two fingerprints.
64-bit string, while Boxes 1 and 2 illustrate the running of this algorithm.

Near-Duplicate Detection.
A trie-tree data structure and meet-in-the-middle approach are utilized for near-duplicate detection.Whilst trie-tree is used to represent all the generated fingerprints of documents in the training phase, meetin-the-middle strategy has an important role in detection procedure.
Trie-tree is an ordered data structure that is used to store a dynamic set or associate array, in which the keys are usually strings.Furthermore, it allows many strings with similar character prefixes to use the same prefix data and store the tails only as separating data.
Firstly, a trie-tree is initially created.After extracting the fingerprints of each document, they are inserted into the tree sequentially if they do not exist there.A complete tree is illustrated in Figure 3.
After building the trie-tree in the training phase, it is used in near-duplicate detection phase.When a new crawled document is added, we apply meet-in-the-middle approach to detect near-duplicate documents.It means that a newly crawled web page is also analyzed and ( 1 64 +  2 64 + 1) fingerprints are generated from it.Hence, its fingerprints are sequentially compared to all the fingerprints existing on the trie-tree.If there is any overlapping fingerprint of new document with any fingerprint on the tree, we can conclude that that document is near-duplicate or duplicate.The advantage of meet-in-the-middle approach is while  is 1; we can detect two documents that have two bit difference in fingerprints.Similarly, if  is 2, 4-bit difference between fingerprints can be recognized.In Figures 4 and 5, an example of meet-in-the-middle approach is provided.Moreover, a near-duplicate document can be also added to the tree that is beneficial to enrich the number of the existing fingerprints on tree.

Dataset and Evaluation
4.1.Data Set.As for checking the result of our proposed approach, we customize a part of the public dataset that is available on [16].The whole of this dataset contains a subset of the WWW-pages that is collected from computer science departments of various universities in January 1997 by World Wide Knowledge Base (Web → Kb) project of the CMU text learning group [17].From this dataset, we randomly pick up more than 1000 web documents and use these as the training dataset.Here, a subset contains 176 files and their contents are chosen and modified to create the training dataset including both duplicate and near-duplicate documents.a web pages dataset, it can also work well and achieve a good result in spam comments detection.In order to detect social spam, we will apply this method on a given spam comments dataset.In the other words, we will use spam comments   dataset instead of using web pages collection to build trie-tree with the same processes.First of all, we build the trie-tree from the training dataset.Each document is sequentially preprocessed by HTML parsing, tokenization, stop words removal, and stemming before generating many fingerprints from the set of most frequent tokens.Then, the trie-tree is built from those fingerprints and its structure as in Figure 3.
From a training dataset with more than 1000 web pages, the trie-tree's size is approximately 22 million nodes including the root node when the number of different bits between two fingerprints () is 2, whereas that of trie-tree is around 960000 nodes with  = 1.Furthermore, its performance is dependent on the number of tokens that are used as the representative keywords of each document ().The chart in Figure 6 illustrates the size of tree based on the 's value of 1 and 2 while its size is not changed by .Nevertheless, in Figures 7 and 8, those graphs reveal computation time performance in training and testing phase, respectively.
In terms of evaluating the performance and accuracy of the proposed method, the experiment is done with the training dataset which includes 176 duplicate and nearduplicate documents.In this experiment, we choose the input   data and the constant used in the proposed method as in Table 2.
In the experiments we have chosen  to be 2, which means that the number of different bits is at most 2. Specifically, the set of fingerprints in this case includes not only 2-bit difference fingerprints but also 1-bit difference ones when comparing with the primary fingerprint.We can detect almost all of the duplicate documents and achieve high accuracy of near-duplicate detection, summarized in Table 3.
Moreover, the time performance of our proposed method is definitely better than the previous work.Because they compare with all other documents, although the result is impressive, the time performance is () approximately, where  is the number of documents.But in our case, because we use the trie-tree data structure to represent the fingerprints, the searching performance is only (), where  is the length of the fingerprint, which is 64, practically a constant.
Near-duplicate detection is a vital issue in data mining.Many existing methods have been proposed to resolve this issue and achieved many different results [18].Generally, we have compared several evaluation metrics of our result to the previous work [1] such as, precision, recall, and -measure.( The computed values in our paper are presented in Figures 9 and 10 with blue bars, whilst those of the previous work are shown with red bars.

Conclusion
Social spam issue is one of important and serious threats in the social networking field.It is obvious that the social and security risks are increasing rapidly because of this issue.Recently, there have been plenty of algorithms that are proposed to resolve it based upon many different techniques.In this paper, we propose an effective method for detecting spam comments on the current social networking sites.By applying trie-tree data structure and meet-in-the-middle approach, our method can easily detect comments which have near-duplicate contents with the given spam collection.The experimental results have proved the effectiveness of our method in both accuracy and time performance.

Figure 1 :
Figure 1: Near-duplicate document detection in the training phase.

Figure 2 :
Figure 2: Working procedure of the sim-hash algorithm.

Figure 7 :
Figure 7: Time performance of training phase.

Figure 8 :
Figure 8: Time performance of testing phase.
These values are calculated by the following formulas: Precision () = No. of true duplicate detected Total No. of true duplicate detected , Recall () = No. of true duplicate detected Total No. of true duplicate in dataset ,  − measure () = 2 ×  ×   +  .

Table 1 :
Example of stop words.

Table 2 :
The experimental input.

Table 3 :
The experimental result.