In this section, the LLR model is proposed to address the challenges caused by outliers. In LLR, an iterative procedure is performed over both source weights and truth computations. Additionally, an optimization algorithm is introduced at the end of this section.
2.1 Problem formulation
We first define several basic terms and then introduce the notations used in this paper with an example.
DEFINITION 1. An entity is an object of interest. A claimed value is a piece of information that a source provides for an entity.
DEFINITION 2. A truth is the most trustworthy claimed value of an entity.
DEFINITION 3. A source reliability score describes the possibility of a source providing trustworthy claims. A higher source reliability score indicates that the source is more reliable and can provide more trustworthy information.
EXAMPLE 1. A weather database is shown in Table
1. In this example, Sources 1, 2, and 3 provide claimed values for the temperature of Beijing. Compared with the ground truths, certain sources provide accurate information, but other sources provide outliers that deviate significantly from the truth. Therefore, outliers may affect the accuracy of the truth estimation.
Our task is to determine the information that is closest to the truth from conflicting information. During the truth discovery procedure, the reliability score
of the i-th source can be inferred from its claim. A higher value of
implies a greater reliability of the information provided by the source. The notations used in this paper are listed in Table
2.
2.2 Local linear regression model
In real-world applications, the existence of objective truth cannot be ensured; multiple reliable facts can be obtained from uncertain information. Each uncertain claim can be transformed from a real value to a single component function, and then, distribution of these claimed values can be estimated as a probability distribution. The probability of claimed values can be regarded as a reflection of the probability of each claim being identified as true. Then, the weights of the sources can be estimated by the probability of claims. Next, we introduce the specific method for estimating the probability distribution.
We assume that a set of claimed values is provided by multiple sources for an entity and model these claimed values as a probability density estimation. Let denote a probability density function, and consider the nonparametric estimation of based on a random sample from . Then, the kernel density estimation of is given by
where
is a kernel function with bandwidth
(
) for the i-th source [
10,
11,
4,
19].
can be ensured as a probability density if it satisfies
In this paper, is determined by a Gaussian kernel as
However, kernel density estimators are not consistent when estimating a density near the finite end points of the support of the density to be estimated due to the boundary effects that occur in nonparametric curve estimation problems [
10].
Next, we define the local linear function and discuss the LLR model in detail. Let
be a continuous function of from a set of claimed values
.
, are the probability values of observations at the points
. Similar to kernel regression estimation, this estimator can be interpreted as a weighted average of the
observations. According to the theory of Nadaraya-Watson [
17,
18], the observation is estimated as a locally weighted average using a kernel as a weighting function. The Nadaraya-Watson estimator can be defined as
From Eqs (
4) and (
5), the estimation of
and the weighted least squares estimator for the local model are equivalent. However, the outliers often cause a large disturbance to the calculation of the weighted averaging in Eq. (
4). Therefore, the result of Eq. (
4) can be impacted by the boundary effect. Because local weighted averaging is used to estimate the probability density value of the point rather than global weighted averaging, a local linear regression method is proposed in this paper. Compared with kernel density-based truth discovery, the interval of claimed values that may be affected by outliers should be narrowed by the adoption of LLR.
The principle behind LLR-based truth discovery is as follows. The curve that describes the shape of the probability distribution of the claimed values can be fitted by a sequence of segmented lines. In the 2D space that is generated by the 2-tuple vector (
), each segmented line can be determined by the function
,
, where
and
are two local parameters. Then, the procedure of local linear estimation can be regarded as the objective function described by Eq. (
6). Consider the estimator
given by the solution to
Then, the estimate of function can be achieved by using the method of least squares. That is, is a constant term in the weighted least squares regression of within the interval of (1, ) with weights for
where
Therefore, the local linear function can be defined as
From Eq. (
8), we can obtain the probability density estimation
, which can be considered as an estimate of the reliability of a source. The function transformation of
,
, is a density function with a Gaussian distribution. By considering source trustworthiness, the extended weighted sample mean function is expressed as
The main concept of the proposed model is that the higher probability density values of sources from the LLR provide trustworthy claimed values. The variance of the error distribution can reflect the reliability of the i-th source. must be examined in more detail because a higher weight indicates that the values provided by the source are closer to the truths. Therefore, it is important to reasonably assign the weights of the sources and obtain the truth. Next, we introduce the loss function and then determine the rule of source weights’ updates.
2.3 Source weight assignment
We must find a set of functions and a set of numbers that minimize the total loss function
where is the number of provided claims for the j-th entity and must be satisfied with
where is the number of claims provided by .
The minimum loss function expressed in Eq. (
10) can be used to determine the rule of source weight updates, where
represents the reliability of the i-th source and
can be given by a specific weighted kernel density estimation
based on Eq. (
9).
is a measure of the distance between the probability density
and
. Therefore, by minimizing the loss function
, we can search for the values for two sets of prior unknown variables
and
, which correspond to the collection of source weights and source’s reliability, respectively.
To minimize the total loss function with constraint Eq. (
11), we use Lagrange multipliers to solve this optimization problem. The real number
is the Lagrange multipliers. Then, a new optimization function is generated as
First, we focus on obtaining a set of numbers for for . The solution of function can be converted into a global optimization problem by solving the equations and and calculating for . Suppose that are fixed. Then, the optimal solution is
In addition, truth discovery methods [
12,
13,
21,
4,
23,
22,
8,
24,
6,
15] typically assume source consistency, which can be described as follows. A source is likely to provide trustworthy information with the same probability for all objects. This assumption is reasonable in many applications and is one of the most important assumptions for estimating source reliability [
15]. Therefore, the source reliability
can be initialized uniformly. Then, the rule of weights updates can be defined as
The truth calculation procedure is described below based on the source weights obtained from Eq. (
14).