A Recommendation System Based on Regression Model of Three-Tier Network Architecture

The sparsity problem of user-item matrix is a major obstacle to improve the accuracy of the traditional collaborative filtering systems, and, meanwhile, it is also responsible for cold-start problem in the collaborative filtering approaches. In this paper, a three-tier network Architecture, which includes user relationship network, item similarity network, and user-item relationship network, is constructed using comprehensive data among the user-item matrix and the social networks. Based on this framework, a Regression Model Recommendation Approach (RMRA) is established to calculate the correlation score between the test user and test item. The correlation score is used to predict the test user preference for the test item. The RMRA mines the potential information among both social networks and user-item matrix to improve the recommendation accuracy and ease the cold-start problem. We conduct experiment based on KDD 2012 real data set. The result indicates that our algorithm performs superiorly compared to traditional collaborative filtering algorithm.


Introduction
A variety of recommendation systems can help people to find useful information from big data sets. In these recommendation systems, the collaborative filtering algorithm (hereinafter referred to as CFA) is the most popular recommendation algorithm for its easy implementation and good expandability [1]. All of these algorithms are facing the following problems: (1) the data sparsity problem, in which the user-item matrix is highly sparse in most cases, leading to the inaccuracy of the user similarity calculated through this matrix; (2) the cold-start problem, in which a new user has not specified enough of his or her product preferences for the system to make effective predictions. Calculating the neighbors of a new user will fail because of the lack of evaluation records. This leads to the result that a new user cannot get effective recommendation. For solving these issues in CFA, Wang et al. have put forward a collaborative filtering algorithm based on both user and item, which brought about improvement to forecast difficulties and inaccuracy stem from data sparseness problem [2].
In early stage, these recommended systems were based on the assumption that users are independent and identically distributed. However, people usually ask for some friends' advice about some things, and friends' suggestions often play an important role in the final decision of the individual. With the rapid development of social network services (hereinafter referred to as SNS), such as Facebook (http://www.facebook.com/), Twitter (http://www.twitter.com/), and Tencent (http://www.qq.com/) a good social platform for information exchanging among people is provided. User relationship, user attributes, and item attributes in SNS can provide more available information for recommendation systems. Recently, improving the performance of the recommendation system using social network information has aroused the interest of many scholars [3][4][5][6].
The main contributions of this paper include (1) improving the accuracy of user similarity using user friendship and user natures in the SNS; (2) improving the accuracy of item similarity using item natures in the SNS; (3) proposing and constructing a three-tier network Architecture, which includes user similarity network, item similarity network, 2 International Journal of Distributed Sensor Networks and user relationship network; (4) establishing a Regression Model Recommendation Approach (hereinafter referred to as RMRA) to calculate the correlation score between the test user and test item. The correlation score is used to predict the test user preference for the test item. The RMRA fully mines the potential information among SNS and user-item matrix, which can improve the recommendation accuracy and ease the cold-start problem. Experiment based on KDD 2012 real data set is conducted. The RMRA performs superiorly compared to traditional CFA.

Related Work
2.1. Prerequisites. In a recommendation system, user set and item set are two basic elements. Let the user set be = { 1 , 2 , . . . , }, where is the number of users, and let the item set be = { 1 , 2 , . . . , }, where is the number of items.
There are different kinds of relationship between users, such as user topology neighbor relations in the SNS, users similarity relationship based on user tags (user tags are some words that describe the user properties) in SNS, and user similarity relationship based on user-item matrix (userbased CFA is based on this relationship to predict). These relationships among users can constitute an integrated user relationship network (denoted by ). Meanwhile, there are a variety of relationships between items, including item similarity relationship based on user-item matrix (item-based CFA is based on this relationship to predict) and item similarity relationship based on item tags (Item tags are some words that describe the item properties) in SNS. These relationships among items can compose an integrated item relationship network (denoted by ). The rating where user rates item in user-item matrix can form a directed bipartite graph (denoted by − ). These three networks are defined as follows.   A three-tier network Architecture is constructed by integrating networks , , and − , as shown in Figure 1.

Construction of an Integrated User Relationship Network.
Many commercial websites, including Amazon, Taobao, and JD, provide users with personalized recommendation. The so-called personalized recommendation refers to the fact that commercial sites offer wish lists for people to buy commodities. Most of these sites offer recommendations for clients using CFA [7], since CFA has better scalability and easy implementation. CFA recommendation system uses two strategies, one is based on user [8,9], and the other one is based on item [10,11]. The former is mainly based on the user high similarity top N neighbors by user-item matrix calculation. If the nodes represent users and weighted edges represent similar relationship between the users, then the CFA constitute a user relationship network defined the same as in Definition 1. The user similarity calculated by using the CFA is sometimes inaccurate, since CFA calculates the similarity between users who only use user-item matrix, and the matrix is very sparse. To improve the accuracy of the similarity between users, this paper proposes a comprehensive relationship network with user-item matrix, trust relationship between users, and similarity between users' tags in SNS.

Calculate the Level of Trust between Users in SNS.
Traditional CFA is based on the assumption that users are independent and identically distributed, which ignores the trust relationship between users and does not comply with the phenomenon that people often ask the opinions of friends in real life. In order to improve the accuracy of the recommendation, some recommendation system takes into account the information in the SNS, known as social recommendation systems [12][13][14]. Users may trust other users in SNS and the level of trust between users is a good predictor of user preferences. Trust relationship in SNS (e.g., the follower and followee in Twitter or Tencent) forms a directed graph as in Definition 1, where weighted directed edge ⟨ , ⟩ represents the fact that user follows user . The term "user follows user " means that trusts to some extent or the interest of user and user is similar. Meanwhile, to a certain extent, user 's preferences can affect user 's decision. We use the Gaussian kernel to calculate the degree to which the user trusts user : where ( , ) indicates the extent to which trusts and represents the topological distance between and based on "follow" relations in SNS. The Gaussian kernel transforms the topological distance between user and user into the trust level of user to user , and it is obvious that ( , ) ∈ (0, 1].

Calculate the User Similarity Using User-Item Matrix.
In the user-based CFA, first the similarity between test user and other users is calculated; then, according to the test user similarity neighbor's preference for test item, the test user preference for test item is predicted. The Pearson correlation coefficient between users is a frequently used method to calculate user similarity [15,16], as shown in where = (1/| |) ∑ ∈ is the average rating value of the user, is the rating of user 's evaluation to the item , and is the item set.

Calculate the User Similarity Using User Tags in SNS.
In SNS, not only are there trust relationships between users, but also some user behavior characteristics are exhibited; for example, users use keywords to represent their selfintroduction which reveal their occupation, interest, and viewpoint. A series of those keywords is called user tags. User tags are the user self-description, which is used to express their standpoint freely. Generally speaking, compared with other information (such as the information via data mining), it is more accurate to obtain user information and express user demand by using the user tags. Therefore, the similarity of user tags represents the user similarity to some degree. Generally, the user tags take the following form: In this paper, the Jaccard Coefficient is used to calculate the similarity of user tags. Denote and as the tag set of users and , respectively; then the Jaccard Coefficient of them is

Building of an Integrated User Relationship Network.
Due to the sparsity of a single data source, the user similarity calculated with a particular data source is sometimes insufficient and inaccurate. Formula (4) is the combination of formulae (1), (2), and (3) and is used to calculate the weights of the edges in Definition 1: where , , and represent the proportion of formulae (1), (2), and (3) in (4), respectively. We use formula (4) to establish the users' similar network .

Construction of an Integrated Item Relationship Network.
In item-based CFA, the similarity between test item and other items is calculated. If each node represents an item and a weighted undirected edge represents similar relationship between a pair of the items, then item-based CFA constitutes an item relationship network as in Definition 2. Item similarity calculated by CFA is inaccurate, since the CFA calculate the similarity between items only using user-item matrix, and the matrix is very sparse, which is the main reason affecting the accuracy of the recommended method. To improve the accuracy of the similarity between items, a comprehensive relationship network has been constructed with user-item matrix and similarity between item tags.

Calculate the Item Similarity Using User-Item Matrix.
Item similarity is the basis of the CFA and is calculated with Pearson's correlation coefficient in this paper: where ( , ) represents the similarity between items and , = (1/| |) ∑ ∈ is average value that item is rated by all users, is rating that user rated item , and is the set of users.

Calculate the Item Similarity Using Item Tags in SNS.
In SNS, there are item tags that use a set of keywords to describe the item. The set of keywords is generally written by industry experts and is more accurate than other items of information (such as item information obtained by data mining). Thus, the similarity between item tags is a good complement to the item similarity based on useritem matrix. Item tag set takes the following general form: = { 1 , 2 , . . . , }. In this paper, the Jaccard Coefficient is used to calculate the similarity of item tag. Denote and as the tag set of items and , respectively; then the Jaccard Coefficient of them is 4 International Journal of Distributed Sensor Networks

Construction of an Integrated Item Similarity Network.
Formula (7) is the combination of formulae (5) and (6) and is used to calculate the weights of the edges in Definition 2: ( , ) = ( , ) + ( , ) where and represent the proportion of formulae (5) and (6) in (7), respectively. Formula (7) is used to establish the item's similar network .

Construction of User-Item Relationship Network − . By
Definition 3, − = ( , , ( − )) is a directed bipartite graph, wherein is user node's set, is item node's set, and edge's set ( − ) is composed of the ratings that users rate items. In this paper, the value of − ( , ) is as follows: A three-tier network Architecture is constructed by integrating networks , , and − , as shown in Figure 1.

The RMRA Based on Three-Tier Network Architecture
The idea of user-based CFA recommendation system is to determine whether a test user has enough preferences for test item (whether or not to recommend the item to the user ). Firstly, based on the user-item matrix, the user 's top N most similar neighbors are obtained and let these neighbors constitute a set . Next, according to the preferences of users in , the test user 's preferences for the test item are predicted. In the CFA, the similarity between any two users is calculated by the vectors of these two users rate to all items, and this algorithm has achieved good results. In view of this, we have reason to believe that any two users' similarity can be calculated by the similarity of the item that the two users rated. This can be established by user similarity regression model as follows: where is the similarity score between the test user and other users , is the similarity score between item and item , ( ) denotes the set of items rated by user , is a constant, and is the coefficient of regression model and represents the item 's contribution to users and . In general, this model assumes that the similarity of the two users can be interpreted as linear combination of the similarity of items that these two users have rated expressed in formula (9). This model's idea is consistent with the CFA.
In formula (9), the value ∑ ∈ ( ) is the sum of the similarity scores between item and the other items rated by user . Denote Extract information for user, item T u = (S uu 1 , S uu 2 , . . . , S uu 7 ) According to formula (10), formula (9) can be rewritten as Formula (11) illustrates the similarity between test user and other users that can be calculated by the similarity between and the items that has rated.
Let be a vector defined as in (12) that is constituted by the similarity between the test user and other users: Let Ψ be a vector formed by the tightness between the item and all users, which is According to formulae (12) and (13), (9) can be extended to the following forms: As seen from formula (14), there is a relationship between and Ψ . Figure 2 shows the relationship between the user and the item , and Figure 3 shows the correlation between and Ψ . The correlation between vectors and Ψ indicates the test user 's preference for the test item . In this paper, the degree of correlation between vectors and Ψ is calculated by the Pearson correlation coefficient and is known as correlation score: where cov and are the covariance and standard deviation, respectively. The correlation score CC can be a good measure of correlation between the test user and test item and indicates the test user 's preference for the test item . In this paper, the top N highest score items are recommended to the test user.

Algorithm Analysis
The time complexity of RMRA is mainly reflected in formulae (1), (2), (3), (5), and (7). Among them, the time complexity of formulae (2) and (5) is ( 2 ), which is the typical time complexity of the CFA, while the time complexity of formulae (1), (3), and (7) is far less than that of formulae (2) and (5). Therefore, the complexity of RMRA is ( 2 ), which is in consistent with that of CFA algorithm.

Data Set.
In order to verify the validity of RMRA, we carried out experiments on the KDD CUP 2012 Track 1 data (http://www.kddcup2012.org/). The following is a brief overview of data collection. The data set of KDD CUP 2012 Track 1 is real sampling user data provided by Tencent (http://www.qq.com/). And the data set includes 2320895 users and 6095 items with a total of 73209277 users' ratings, wherein the data files and data format are as follows: user-item matrix (rec log train.txt) file is in a format of ( . Let represent ratings of user to item , = 1 means that user follows item , and = 0 means that user refuses to follow item . Density calculation of user-item matrix is shown as follows 73209277 2320895 × 6095 = 0.51%.
Thus, user-item matrix is pretty sparse. In this paper, the experimental facilities include a computer with CPU 2.6 GHz and RAM 8 G, because the size of the original data set exceeds our computer's processing power in both space and time, and thus we randomly take out 10 million data sets from the original data set as the experimental data set, and, in accordance with the tenfold cross-validation method, the experimental data set is randomly divided into 10 parts, with each part being used as a test set and the remaining 9 parts as the training set.

Evaluation Criteria.
In this experiment, the RMRA's objective is to recommend an item list that the test user is interested in. In order to verify the accuracy of the recommended results, average accuracy (Mean Average Precision at , which is MAP@ ) is used to evaluate the accuracy of RMRA [17].
Given a test user and a recommendatory list of items sorted by the correlation score, AP@ is shown as follows: where is the total number of list 's items that is followed by the user , and ( ) is the accuracy of the th position of the list , which is defined as follows: Then, MAP@ is defined as follows:

Experimental Procedure
Step 1. Calculate user similarity based user-item matrix, user tag's similarity, and user relationship topological distance, and then generate user relationship network .
Step 2. Calculate item similarity based on user-item matrix and item tag's similarity; then generate item relationship network .
Step 3. Generate user-item relationship network − according to the user-item matrix.
Step 6. Carry on experiment on all 10 data sets. The average of MAP@3 in these experiments is the final MAP@3.
Step 7. Calculate the MAP@3 according to the user-based CFA and item-based CFA. Then compare them to the RMRA.

Analysis of Results.
In order to achieve the best average accuracy, the model parameters were trained. First, for the parameters and in formula (7) of item similarity network, since the two coefficients satisfy + = 1, we just need to train and make value increase from 0 to 1 by step size 0.1. Then, according to this, we draw the MAP-curve as in Figure 4 to illustrate the influence of parameter on average accuracy.
In Figure 4, = 0 means that only item tags are used, and = 1 represents that only the user-item matrix is used in calculating item similarity in the item similarity network. Figure 4 shows that recommendation effect is ineffective when only item tag similarity or item-based similarity is considered. With gradual increase from 0 to 1, it is observed that, starting from = 0 to 1, the average accuracy value increases first and then decreases. When equals 0.2, the average accuracy achieves its maximum value. Here is actually a tradeoff of item similarity and user-item matrix similarity. It is observed that item tag similarity solves the data sparseness problem of user-item matrix. When = 0.2, tradeoff gets best result and the average accuracy is the highest.
After the parameters of the item similarity network (formula (7)) have been determined, the parameters of user similarity network (formula (4)) need to be trained. Because + + = 1, we only need to determine two of the parameters. First, we fixed = 0.1 and then add to 0.9 from 0 by step size of 0.1. According to this, we draw the MAP-curve as in Figure 5 to illustrate the influence of parameter on average accuracy.
In Figure 5, = 0 means that only user tags and user SNS relationship are used, and = 0.9 means that only useritem matrix and user SNS relationship are used in calculating user similarity in the user similarity network. It can be seen from Figure 5 that, with the increase of , the curve represents average accuracy value that is in a trend of increases first and then decrease. And the average accuracy achieves the highest value when = 0.5 and = 0.6. Thus, we fix as 0.5 and, 0.6 respectively, then change the value of , and finally get the curve shown in Figure 6, showing the impact of parameters on the average accuracy.
It can be seen from Figure 6 that the peaks of average accuracy in both cases were between 0.0 and 0.2. And in this interval, the curve = 0.6 is generally higher than the curve = 0.5. From this, we know that when = 0.6, = 0.1, and = 0.3, the average accuracy achieves its maximum value. We get these different results of MAP@3 through Step 7 and compare them to the user-based CFA, Item-based CFA, and RMRA in Table 1. Table 1: Comparison of these models.

Model
The user-based CFA The item-based CFA RMRA MAP@3 0.385 0.327 0.449 Table 1 shows that, after combining three networks, the recommendation effect using RMRA has been significantly improved. User tags similarity and user SNS relationship solve a part of data sparsity and cold-start problem.

Conclusion
In this paper, first a three-tier network Architecture, which includes user relationship network, item similarity network, and user-item relationship network, is constructed using comprehensive data among the user-item matrix and the social networks. Then, based on these networks, the RMRA was established to calculate the correlation score between the test user and test item. The correlation score is used to predict the test user preference for the test item. The experimental result indicates that our algorithm performs superior than the traditional CFA.
In our future work, we will focus on improving the performance of the recommendation system using a variety of information combination methods. Since the size of raw data sets processed by a recommendation system is very large, parallel algorithms in recommendation system should be used in our future research.