Skip to main content

[]

Intended for healthcare professionals
Skip to main content
Free access
Research article
First published online May 1, 2019

Local linear regression-based unsupervised truth discovery

Abstract

In the era of big data, the data provided by multiple sources for the same entity may result in conflicting information. Because sources with higher reliability degrees provide true information more frequently, the truth can be obtained by estimating the weights or reliability degrees of the sources. Due to hostile websites or faulty sensors, some sources may occasionally provide outliers that deviate significantly from the truth. However, in the majority of existing truth discovery methods, each source should be uniformly assigned a weight at the initial stage. Therefore, the accuracy of truth estimation is degraded by outliers. Several previous studies have proposed kernel density estimation-based truth discovery algorithms to solve the problems caused by outliers. These approaches aim to estimate the probability distribution of observation values to assess the reliability of sources. Unfortunately, the data with outliers that are smoothed by Gaussian kernels may cause more deviations from the truth. Thus, we propose a local linear regression (LLR) method for addressing the problems caused by outliers. The proposed method can effectively estimate the source reliability and the truth of the datasets with outliers. Experiments on two real-world datasets demonstrate that the proposed method yields more accurate results than existing state-of-the-art methods.

1. Introduction

We are currently living in the age of information explosion, where information on any topic can be conveniently obtained. Multiple sources may provide information on the same entity. For example, the same information on a certain customer can be accessed from multiple databases in a company, and the price of a book may be presented on different websites. However, it is difficult to determine which claims are true [12, 13, 21, 6, 5, 9, 20, 2, 7, 3]. In the light of this problem, truth discovery is emerging as a promising technique, and its importance has been recognized by the research community.
An important assumption in most existing truth discovery approaches is that every data source provides low-quality data without intentions. In real-world application scenarios, certain sources occasionally provide outliers that significantly deviate from the truth due to faulty sensors, transmission errors, hostile websites, or other issues. Because the probability distribution of claimed values is vital to the accuracy of truth discovery, parameter estimation cannot be performed over datasets with outliers.
The problems caused by outliers was first investigated by using non-parameter estimation in [21]. For simplicity, the method based on non-parameter estimation presented in [21] is called kernel density estimation from multiple sources (KDEm). KDEm uses the kernel density function to approximate the probability distribution of claimed values. Then, the reliability of sources can be accessed by sampling the curve determined by the probability density function of the claimed values. However, KDEm neglects the fact that boundary points can be produced during the kernel smoothing process if there are outliers in the claimed values [10].
In this paper, we propose a local linear regression (LLR) method to obtain the truths from the claimed values provided by multiple sources in the presence of outliers. LLR uses local weighting to realize kernel regression estimation. The observed probability distribution of the values with outliers (which are processed by the kernel smoothed function) cannot be changed dramatically. Therefore, the processed boundary points deviate slightly from the truth, which has a minor impact on the estimation of the probability distribution. Then, the estimated truths can achieve accurate results.
Contributions of this Paper
1.
Based on kernel density estimation, the corresponding processing of outliers can alleviate the effect of boundary points using the local linear regression method.
2.
We propose an efficient optimization framework to update the source weights and truths, and the convergence rate of the proposed method’s algorithm is faster than existing algorithms.
3.
We test the proposed method on two real-world applications, and the results demonstrate the advantages of the proposed approach in addressing the problems caused by outliers.

2. Methods

In this section, the LLR model is proposed to address the challenges caused by outliers. In LLR, an iterative procedure is performed over both source weights and truth computations. Additionally, an optimization algorithm is introduced at the end of this section.

2.1 Problem formulation

We first define several basic terms and then introduce the notations used in this paper with an example.
DEFINITION 1. An entity is an object of interest. A claimed value is a piece of information that a source provides for an entity.
DEFINITION 2. A truth is the most trustworthy claimed value of an entity.
DEFINITION 3. A source reliability score describes the possibility of a source providing trustworthy claims. A higher source reliability score indicates that the source is more reliable and can provide more trustworthy information.
Table 1 A weather database
EntitySource IDTemperatureGround truths
BeijingSource 127C25C
BeijingSource 225C 
BeijingSource 332C 
LondonSource 116C18C
LondonSource 218C 
New YorkSource 121C26C
EXAMPLE 1. A weather database is shown in Table 1. In this example, Sources 1, 2, and 3 provide claimed values for the temperature of Beijing. Compared with the ground truths, certain sources provide accurate information, but other sources provide outliers that deviate significantly from the truth. Therefore, outliers may affect the accuracy of the truth estimation.
Our task is to determine the information that is closest to the truth from conflicting information. During the truth discovery procedure, the reliability score Wi of the i-th source can be inferred from its claim. A higher value of Wi implies a greater reliability of the information provided by the source. The notations used in this paper are listed in Table 2.
Table 2 Notations
Notation definition 
xSet of claimed values
ESet of entities
SSet of sources
Sjj-th source
Eii-th entity
xijClaimed values provided by the j-th source for the i-th entity
njNumber of sources that provides claimed values for the j-th entity
mNumber of entities
WiWeight of i-th source
XiXi={xij}; set claims nj for the j-th entity
VjTruth of the j-th entity
Vj*Vj*={Vj}; set of truth mi of an entity
LijLocal linear probability density score of Xij

2.2 Local linear regression model

In real-world applications, the existence of objective truth cannot be ensured; multiple reliable facts can be obtained from uncertain information. Each uncertain claim can be transformed from a real value to a single component function, and then, distribution of these claimed values can be estimated as a probability distribution. The probability of claimed values can be regarded as a reflection of the probability of each claim being identified as true. Then, the weights of the sources can be estimated by the probability of claims. Next, we introduce the specific method for estimating the probability distribution.
We assume that a set of claimed values {x1,x2,,xi,in} is provided by multiple sources for an entity and model these claimed values as a probability density estimation. Let P(x) denote a probability density function, and consider the nonparametric estimation of P(x) based on a random sample {x1,x2,,xi,in} from P(x). Then, the kernel density estimation of P(x) is given by
Phi(x)=1ni=1n1hiK(x-xihi),
(1)
where K(x-xihi) is a kernel function with bandwidth hi (hi>0) for the i-th source [10, 11, 4, 19]. K(x) can be ensured as a probability density if it satisfies
K(x)0,K(x)𝑑x=1.
(2)
In this paper, K(x) is determined by a Gaussian kernel as
K(x,x*)=12πhexp(-12x-x*2hi2).
(3)
However, kernel density estimators are not consistent when estimating a density near the finite end points of the support of the density to be estimated due to the boundary effects that occur in nonparametric curve estimation problems [10].
Next, we define the local linear function and discuss the LLR model in detail. Let Y(X):RpR be a continuous function of from a set of claimed values {x1,x2,,xi,in}. Yi,i=1,,n, are the probability values of observations at the points xi. Similar to kernel regression estimation, this estimator can be interpreted as a weighted average of the Yi observations. According to the theory of Nadaraya-Watson [17, 18], the observation is estimated as a locally weighted average using a kernel as a weighting function. The Nadaraya-Watson estimator can be defined as
M(x)=1ni=1nWi(x)Yi,
(4)
M^(x)=θ^=min1ni=1nWi(x)(Yi-θ)2=WiYiWi, where Wi=K(x-xih).
(5)
From Eqs (4) and (5), the estimation of M(x) and the weighted least squares estimator for the local model are equivalent. However, the outliers often cause a large disturbance to the calculation of the weighted averaging in Eq. (4). Therefore, the result of Eq. (4) can be impacted by the boundary effect. Because local weighted averaging is used to estimate the probability density value of the point rather than global weighted averaging, a local linear regression method is proposed in this paper. Compared with kernel density-based truth discovery, the interval of claimed values that may be affected by outliers should be narrowed by the adoption of LLR.
The principle behind LLR-based truth discovery is as follows. The curve that describes the shape of the probability distribution of the claimed values can be fitted by a sequence of segmented lines. In the 2D space that is generated by the 2-tuple vector (xi,yi), each segmented line can be determined by the function Yi=a(x)+b(x)Xi, Xi(x-h,x+h), where a(x) and b(x) are two local parameters. Then, the procedure of local linear estimation can be regarded as the objective function described by Eq. (6). Consider the estimator g^(x) given by the solution to
{min1ni=1n(Yi-a(x)-b(x)Xi)2Khn(Xi-x)g^(x)=a^(x)+b^(x)Xi
(6)
Then, the estimate of function g^(x) can be achieved by using the method of least squares. That is, g^(x) is a constant term in the weighted least squares regression of Xi-x within the interval of (1, Xi-x) with weights Khn(Xi-x) for
g^(X,hn)=eT(XxTWxXx)-1XxTWxY,
(7)
where
e=(1,0)T,Xx=(Xx,1,,Xx,n)T,Xx,i=(1,Xi-X)T,
Wx=𝑑𝑖𝑎𝑔[Khn(X1-x),,Khn(Xn-x)],Y=[Y1,,Yn]T.
Therefore, the local linear function can be defined as
L(xi)=1ni=1n(Yi-g^(x)2Khn(Xi-x).
(8)
From Eq. (8), we can obtain the probability density estimation L(x), which can be considered as an estimate of the reliability of a source. The function transformation of xi, L(x)=Φi(x), is a density function with a Gaussian distribution. By considering source trustworthiness, the extended weighted sample mean function is expressed as
L^(x)=1niiSiWiΦi(xi).
(9)
The main concept of the proposed model is that the higher probability density values of sources from the LLR provide trustworthy claimed values. The variance of the error distribution Φi(xi)-L^i2 can reflect the reliability of the i-th source. L^(x) must be examined in more detail because a higher weight indicates that the values provided by the source are closer to the truths. Therefore, it is important to reasonably assign the weights of the sources and obtain the truth. Next, we introduce the loss function and then determine the rule of source weights’ updates.

2.3 Source weight assignment

We must find a set of functions L^1,,L^m and a set of numbers P1,,Pn that minimize the total loss function
F(L^1,,L^m;P1,,Pn)=j=1m1njiSjPiΦj(xij)-L^j2,
(10)
where mi is the number of provided claims for the j-th entity and P1,,Pn must be satisfied with
i=1nmiexp(-Pi)=1,
(11)
where nj is the number of claims provided by Sj.
The minimum loss function expressed in Eq. (10) can be used to determine the rule of source weight updates, where Pi represents the reliability of the i-th source and L^(x) can be given by a specific weighted kernel density estimation L^(x)=1niiSiWiΦi(xi) based on Eq. (9). Φi(xi)-Li^2 is a measure of the distance between the probability density Li^ and Φi(xi). Therefore, by minimizing the loss function F, we can search for the values for two sets of prior unknown variables Wi and Pi, which correspond to the collection of source weights and source’s reliability, respectively.
To minimize the total loss function with constraint Eq. (11), we use Lagrange multipliers to solve this optimization problem. The real number λ is the Lagrange multipliers. Then, a new optimization function is generated as
Q=F(L1^,,Lm^;P1,,Pn)+λ(i=1nmiexp(-Pi)-1).
(12)
First, we focus on obtaining a set of numbers for Pi for i=1,,n. The solution of function Q can be converted into a global optimization problem by solving the equations PiQ=0 and λQ=0 and calculating 22PiQ=0 for i=1,,n. Suppose that L1^,,Lm^ are fixed. Then, the optimal solution Pi is
Pi=-log(jSi1njΦj(xij)-L^(x)2mi=1njSi1njΦj(xij)-L^(x)2).
(13)
In addition, truth discovery methods [12, 13, 21, 4, 23, 22, 8, 24, 6, 15] typically assume source consistency, which can be described as follows. A source is likely to provide trustworthy information with the same probability for all objects. This assumption is reasonable in many applications and is one of the most important assumptions for estimating source reliability [15]. Therefore, the source reliability P1n can be initialized uniformly. Then, the rule of weights updates can be defined as
Wi=PiiSjPi.
(14)
The truth calculation procedure is described below based on the source weights obtained from Eq. (14).

2.4 Truth calculation

In general, if the piece of information is from a reliable source, then it is more trustworthy, and the source that provides trustworthy information is more reliable. Therefore, the main concept of truth discovery is that a source with high reliability can provide trustworthy information more frequently. Many truth discovery methods [12, 13, 21, 4, 23, 22, 8, 24, 6, 15, 16] use the weighted averaging strategy to obtain the truths, which is superior to voting or averaging approaches, where the votes from different sources are equally weighted.
Inspired by the method of weighted averaging, we can calculate the truth by weighted averaging the claimed values xi and the probability density values Lj of in Eq. (8). Therefore, the estimated truth for the j-th entity is
Vj=iSjWijLjxijiSjWijLj.
(15)

2.5 Algorithm flow

Algorithm 1: LLR Algorithm Flow
1. Initialize P1==Pn,i=1,,n;
2. Update L^(x) by L^(x)=1niiSiWiΦi(xi)
where Wi=PiiSjPi,i=1,,n;
3. Update Pi by
Pi=-log(jSi1njΦj(xij)-L^(x)2mi=1njSi1njΦj(xij)-L^(x)2),i=1,,n;
4. Update the truth by
Vj=iSjWijLjxijiSjWijLj,j=1,,m;
5. Repeat steps 2, 3 and 4 until the total loss F(L1,,Lm;P1,,Pn) does not change.
In above algorithm, there are two major steps in each iteration:
Step 1:
Update the weights. We first compute the initial source weights based on Wi=PiiSjPi, where P1==Pn are updated according to in Eq. (13) each iteration.
Step 2:
Update the truths. We have already calculated the weight of each source. The truth of each entity is calculated by the weight combination of Lj and the claimed values xi, as shown in Eq. (15).
An iterative algorithm is used to achieve more accurate results. The proposed algorithm obtains the probability density estimation from LLR and then iteratively updates the source weights and truths until the algorithm converges. Because application of a block coordinate descent [1] iterative method can yield continuous reductions in the total loss function presented in Eq. (10), the estimated source weight scores can be obtained. Experimental results on the convergence rate are provided in the section of Results.

3. Results

The proposed LLR method was tested on two real-world datasets. The experimental results illustrate that our method outperforms the state-of-the-art truth discovery methods when there are outliers in the data set.

3.1 Experimental setup

Next, we introduce the basic methods and procedures of the experiments and discuss the performance of the experimental results.

3.1.1 Performance measures

For each entity, there is only one ground truth Vj* and an estimated truth Vj*^. We use the mean absolute error (MAE) and the rooted mean squared error (RMSE) to evaluate the performance of various existing methods. The MAE and RMSE are defined as
𝑀𝐴𝐸=1mj=1mVj*-Vj*^
𝑅𝑀𝑆𝐸=1mj=1mVj*-Vj*^2.
Here, a smaller MAE or RMSE indicates better performance.

3.1.2 Baseline methods

The following baseline methods are used to resolve the conflicts among the information originating from multiple sources. We first introduce two methods, the mean and median, and then describe the experiments.
Mean – the truth for each entity is the mean of the claims.
Median – the truth for each entity is the median of the claims.
The baseline methods also include effective truth discovery methods, including CRH [13], KDEm [21], and CATD [12]. CRH seamlessly integrates data of various types by estimating information trustworthiness. Based on the kernel function, KDEm proposes an uncertainty-aware approach to determine the source reliability, which estimates the probability distribution and summarizes the trustworthy information. CATD is a novel truth discovery approach that considers the confidence interval of the source reliability estimation. For LLR, the Gaussian kernel is applied and hi is set to be the median absolute deviation (MAD) of the data {x1,x2,,xi,in}:hi=𝑀𝐴𝐷=i=1n(xi-x¯)n.

3.2 Experimental results

Next, we present the experiments conducted on two real-world datasets to test the effectiveness of our method.

3.2.1 Experiments on real-world datasets

Weather Dataset. Weather information is recognized as a good observation object because there can be significant differences in information from the same area. Specifically, we gathered the temperature forecast information of 88 large US cities from HAM weather,2 Wunder ground,3 and World Weather On-line.4 The dataset comprises a duration of more than two months (Oct. 7, 2013 to Dec. 17, 2013). In addition to the forecast information, the actual highest and lowest temperature observations of each day were also collected for evaluation purposes.
Stock Dataset. The stock dataset [14] contains a large number of observations from different stock trading platforms. This observation information is a good test bed because the data are continuous in nature. The stock data, which were downloaded from the website in July 2011, consist of 1,000 stock symbols and 16 properties from 55 sources.
In above two open datasets, the claimed value about a certain entity provided by the data source can be modeled as the sum of the truth and the noise. The concept of truth discovery is proposed to resolve the conflicts among information originating from multiple sources. The additive noise is the main cause of data conflicts. If a source is unreliable, the noise it makes have a wide spectrum in general, which means that the variance of the noise distribution is big. Therefore, the noise distribution can be described by a Gaussian distribution with mean 0 and a certain variance. According to this general mathematical description, the outlier can be represented by the sum of the truth and the noise with a big variance, in contradiction, missing values can be estimated by the sum of the truth and the noise with a small variance. It is known that median is less sensitive to the existence of outliers, and thus the median value for truth computation is more desirable in noisy environments. Therefore, we can use the median value of claims about a certain entity provided by different sources to estimate the missing values.

3.2.2 Performance comparison

Table 3 Performance comparison
 WeatherStock
MethodMAERMSEMAERMSE
LLR1.77112.6766133422249111
Mean10.000413.0922304431558596
Median4.32019.4150276031504009
KDEm5.70219.2813307802550995
CRH4.15009.2168335365620346
CATD11.293017.1569276030504001
Figure 1. Results of the experiments on Weather Data.
Figure 2. Results of the experiments on Stock Data.
Table 3 provides the experimental results for the weather and stock datasets. Among the baseline methods, the mean and median methods aggregate the collected information to estimate the truth and do not consider the reliability of the source. The mean and median may deviate from the truth when many outliers are present. The experimental results of each method on the weather dataset are shown in Fig. 1. The MAE and RMSE of the proposed method are 68.94% and 73.52% lower than those of the baseline method KDEm. The experimental results of each method on the stock dataset are shown in Fig. 2. The MAE and RMSE of the proposed method are 56.65% and 54.79% lower than those of KDEm. The KDEm estimate of the probability distribution of outliers may produce boundary points, and thus, its experimental results are worse than those for our method. CRH and CATD cannot precisely estimate the truth because they were originally designed for categorical data and extended to address numeric claims and are therefore overly sensitive to outliers.

3.3 Efficiency

Next, we test the efficiency of LLR. We first test the rate of convergence and then provide the running time of the proposed method.

3.3.1 Convergence rate

The convergent speed of this algorithm is also considered when measuring the performance. The convergence rate is demonstrated using the stock dataset. As shown in Fig. 3, the weight of sources 2, 3, 4, and 5 increase continuously, whereas the weight of source 1 decreases until convergence. The source weights reach a stable stage at the end of the third iteration. Therefore, our algorithm converges reasonably and quickly.
Figure 3. Source weight in each iteration.

3.3.2 Running time

We further test the efficiency of the proposed method on the stock dataset to demonstrate the running time of LLR. We also investigate the relationship between the running time and the number of observations. From Fig. 4, there is a strong linear relationship between the running time and number of observations. To further prove the strong linear relationship, we calculated Pearson’s correlation coefficient, a commonly used metric to examine the linear relationship between variables. The coefficient ranges from 1 to -1, where a value closer to 1 (or -1) indicates a stronger positive (or negative) linear relationship among the variables. The experimental results are shown in Table 4, where Pearson’s correlation coefficients for running time and the number of observations is 0.9985, indicating that they are highly linearly correlated.
Table 4 Running time of LLR
Number of observationsTime (s)
5.5 × 1020.135
1.1 × 1030.254
2.2 × 1030.531
4.4 × 1030.938
8.8 × 1031.686
1.1 × 1042.208
Pearson correlation0.9985
Figure 4. Running time.

4. Related work

The problem of improving data quality has been extensively studied. Several related works resolved the conflicts among information originating from multiple sources. The previous truth discovery methods [12, 13, 21, 23, 22, 8, 24, 6, 15, 16] estimate the truth depending on the source’s reliability and the trustworthiness of the information.
Recently, additional truth discovery methods have been proposed to address practical problems in several scenarios. Truth discovery concepts were first proposed by Yin et al. [23]. Several subsequent studies have aimed to solve serious problems associated with truth discovery. GTM [24] proposed a Bayesian probabilistic model to resolve the crucial problem of truth finding in numerical data. Considering that the sources may provide different types of data, CRH aimed to resolve conflicts in heterogeneous data. CATD was proposed to address the long-tail phenomena that occur when multiple sources provide only a small number of claimed values by considering the confidence interval of the source’s reliability estimation.
However, these methods use parametric estimation to estimate the truths [12, 13, 23, 22, 8, 24, 6, 15, 16], where the parameter is estimated in the case of the known sample distribution. Unlike parametric statistics, nonparametric statistics make no assumptions regarding the probability distributions of the variables being assessed. Assuming that the specific distribution of the data samples is not known, nonparametric estimation can be considered a good approach to estimate the truth. However, there may be numerous outliers in the test sample dataset, which cause a large disturbance in the probability distributions of the variables. Furthermore, the outliers impact the boundary points generated by the kernel density estimation function. The LLR method proposed here can alleviate the effects of outliers.

5. Discussions and future work

Truth discovery is motivated by the strong need to resolve conflicts among multi-source noisy information, since conflicts are commonly observed in database, the Web, crowdsourced data, etc. In Wan and Chen’s work [21], they notice that truth can be represented by the uncertainty of trustworthy opinion, then propose an uncertainty-aware approach called Kernel Density Estimation from Multiple Sources (KDEm) to estimate its probability distribution, and summarize trustworthy information based on this distribution. Therefore, kernel density estimation has been recognized as an effective method for inferring the truth, and the weight of sources can be obtained from the distances between the kernel functions and the distribution of claimed values. The contribution of our paper is that: based on kernel density estimation, the corresponding processing of outliers can alleviate the effect of boundary points using the local linear regression method.
Although various methods about truth discovery have been proposed, there still exist many issues to tackle. Since most existing truth discovery methods [12, 13, 21, 23, 22, 8, 24, 6, 15, 16] are designed to work in the iterative manner, the outlier provided by unreliable sources are involved in each iteration. Excluding the outlier prior to truth discovery will be investigated in our future work.

6. Conclusion

Truth discovery aims to obtain the truths from trustworthy claims provided by multiple sources. Outlier data may deviate significantly from the truth, and existing works do not accurately estimate the truth in the case when there are outliers in data sets. In this paper, the kernel density estimation function is adopted to estimate the probability distribution. However, the outliers that are smoothed by the Gaussian kernel may cause a significant change in the probability score. The LLR method, which can minimize the deviation of the overall data valuation, is applied to handle the outliers. In this manner, the effect of boundary points caused by the kernel density estimation can be alleviated. We also derive an optimization algorithm to iteratively update the source weights and truths, which can yield accurate results. Experiments on two real-world datasets demonstrate the clear advantages of our method over the traditional truth discovery methods.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China [Nos. 81471741, 81471728, 81671770, and 61802327], the Chinese National Natural Science Foundation “Development of Data Sharing Platform of Tibetan Plateau’s Multi-Source Land-Atmosphere System Information” under grant number 91637313, China Special Fund for Meteorological Research in the Public Interest (Major projects) (GYHY201506001-7), the Natural Science Foundation of Hunan Province [No. 2018JJ3511].

Footnotes

References

1. Bertsekas D.P., Nonlinear programming, Athena Scientific, ISBN: 978-1-886529-05-2, 1999.
2. Bleiholder J. and Naumann F., Conflict handling strategies in an integrated information system, in: International World Wide Web Conferences, 2006.
3. Bleiholder J. and Naumann F., Data fusion, ACM Computing Surveys, 2009.
4. Chacon J.E. Mateufigueras G. and Martinfernandez J.A., Gaussian kernels for density estimation with compositional data, Computers Geosciences 37(5) (2011), 702–711.
5. Dong X.L. Bertiequille L. and Srivastava D., Integrating conflicting data: the role of source dependence, Very Large Data Bases 2(1) (2009), 550–561.
6. Dong X.L. and Naumann F., Data fusion: resolving data conflicts for integration, Very Large Data Bases 2(2) (2009), 1654–1655.
7. Fan W. Geerts F. Tang N. and Yu W., Conflict resolution with data currency and consistency, Journal of Data and Information Quality 5 (2014), 6.
8. Fan W. Yang S. Perros H.G. and Pei J., A multi-dimensional trust-aware cloud service selection mechanism based on evidential reasoning approach, International Journal of Automation and Computing 12(2) (2015), 208–219.
9. Galland A. Abiteboul S. Marian A. and Senellart P., Corroborating information from disagreeing views, 2010, 131–140.
10. Karunamuni R.J. and Alberts T., On boundary correction in kernel density estimation, Statistical Methodology 2(3) (2005), 191–212.
11. Kim J. and Scott C., Robust kernel density estimation, Journal of Machine Learning Research 13(1) (2012), 2529–2565.
12. Li Q. Li Y. Gao J. Su L. Zhao B. Demirbas M. Fan W. and Han J., A confidence-aware approach for truth discovery on long-tail data, Very Large Data Bases 8(4) (2014), 425–436.
13. Li Q. Li Y. Gao J. Zhao B. Fan W. and Han J., Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation, in: International Conference on Management of Data, 2014, pp. 1187–1198.
14. Li X. Dong X.L. Lyons K. Meng W. and Srivastava D., Truth finding on the deep web: is the problem solved? Very Large Data Bases 6(2) (2012), 97–108.
15. Li Y. Gao J. Meng C. Li Q. Su L. Zhao B. Fan W. and Han J., A survey on truth discovery, Sigkdd Explorations 17(2) (2016), 1–16.
16. Li Y. Li Q. Gao J. Su L. Zhao B. Fan W. and Han J., On the discovery of evolving truth, Knowledge Discovery and Data Mining 2015 (2015), 675–684.
17. Loukas S. and Nadaraya E.A., Nonparametric estimation of probability densities and regression curves, Journal of The Royal Statistical Society Series A-statistics in Society 154(1) (1991), 184.
18. Nadaraya E.A., On estimating regression, Theory of Probability and Its Applications 9(1) (1964), 141–142.
19. Parzen E., On estimation of a probability density function and mode, Annals of Mathematical Statistics 33(3) (1962), 1065–1076.
20. Schubert E. Koos A. Emrich T. Zufle A. Schmid K.A. and Zimek A., A framework for clustering uncertain data, Very Large Data Bases 8(12) (2015), 1976–1979.
21. Wan M. Chen X. Kaplan L.M. Han J. Gao J. and Zhao B., From truth discovery to trustworthy opinion discovery: An uncertainty-aware quantitative modeling approach, 2016, 1885–1894.
22. Wang D. Kaplan L.M. Le H.K. and Abdelzaher T.F., On truth discovery in social sensing: a maximum likelihood estimation approach, 2012, 233–244.
23. Yin X. Han J. and Yu P.S., Truth discovery with multiple conflicting information providers on the web, IEEE Transactions on Knowledge and Data Engineering 20(6) (2008), 796–808.
24. Zhao B. Rubinstein B.I.P. Gemmell J. and Han J., A bayesian approach to discovering truth from conflicting sources for data integration, Very Large Data Bases 5(6) (2012), 550–561.

Cite article

Cite article

Cite article

OR

Download to reference manager

If you have citation software installed, you can download article citation data to the citation manager of your choice

Share options

Share

Share this article

Share with email
Email Article Link
Share on social media

Share access to this article

Sharing links are not relevant where the article is open access and not available if you do not have a subscription.

For more information view the Sage Journals article sharing page.

Information, rights and permissions

Information

Published In

Article first published online: May 1, 2019
Issue published: May 2019

Keywords

  1. Truth discovery
  2. local linear regression
  3. kernel density estimation
  4. outlier data
  5. source reliability

Rights and permissions

© 2019 – IOS Press and the authors. All rights reserved.
Request permissions for this article.

Authors

Affiliations

Zhiqiang Zhang1
National Meteorological Information Centre, Beijing 100081, China
Yichao Hu1
The College of Information Engineering, Xiangtan University, Hunan 411105, China
Songtao Ye*
The College of Information Engineering, Xiangtan University, Hunan 411105, China
Binbin Nie
Division of Nuclear Technology and Applications, Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China
Sen Tian
The College of Information Engineering, Xiangtan University, Hunan 411105, China

Notes

1
These authors contributed equally to this work.
*
Corresponding author: Songtao Ye, The College of Information Engineering, Xiangtan University, Hunan 411105, China. E-mail: [email protected].

Metrics and citations

Metrics

Journals metrics

This article was published in Intelligent Data Analysis: An International Journal.

View All Journal Metrics

Article usage*

Total views and downloads: 17

*Article usage tracking started in December 2016


Articles citing this one

Receive email alerts when this article is cited

Web of Science: 0

Crossref: 0

There are no citing articles to show.

Figures and tables

Figures & Media

Tables

View Options

View options

PDF/EPUB

View PDF/EPUB

Access options

If you have access to journal content via a personal subscription, university, library, employer or society, select from the options below:


Alternatively, view purchase options below:

Access journal content via a DeepDyve subscription or find out more about this option.