Domain Terminology Collection for Semantic Interpretation of Sensor Network Data

Many studies have investigated the management of data delivered over sensor networks and attempted to standardize their relations. Sensor data come from numerous tangible and intangible sources, and existing work has focused on the integration and management of the sensor data itself. The data should be interpreted according to the sensor environment and related objects, even though the data type, and even the value, is exactly the same. This means that the sensor data should have semantic connections with all objects, and so a knowledge base that covers all domains should be constructed. In this paper, we suggest a method of domain terminology collection based on Wikipedia category information in order to prepare seed data for such knowledge bases. However, Wikipedia has two weaknesses, namely, loops and unreasonable generalizations in the category structure. To overcome these weaknesses, we utilize a horizontal bootstrapping method for category searches and domain-term collection. Both the category-article and article-link relations defined in Wikipedia are employed as terminology indicators, and we use a new measure to calculate the similarity between categories. By evaluating various aspects of the proposed approach, we show that it outperforms the baseline method, having wider coverage and higher precision. The collected domain terminologies can assist the construction of domain knowledge bases for the semantic interpretation of sensor data.


Introduction
Many studies have considered the integrated management of data received from sensor networks [1,2].In particular, some significant research has focused on ontology-based approaches for developing standardized and semantic relations between the data [3][4][5][6].The data collected from sensors represents various tangible and intangible objects, such as temperature, acceleration, GPS, light, barometric pressure, magnetic degree, and acoustic measurements.Existing research deals with the integration of the sensor data itself, the definition of standard schemes, and management applications for understanding the sensor data.However, the data could be interpreted differently according to the environment and which objects are related to the sensor, even though the data type, and even its value, may be the same.For example, two 1 ∘ C measurements from a refrigerator and an aquarium will have very different interpretations.To make appropriate decisions in different situations, the conceptual idea of the sensor network domain should be related to other concepts in different domains.To attack this issue, knowledge bases incorporating ontology, taxonomy, folksonomy, or thesaurus information should first be constructed, allowing reliable connections to be formed between concepts of the sensor network and concepts of other domain knowledge bases.The fundamental step in constructing knowledge bases is to collect domain terminologies, and our research deals with a domain-term collection method.Domain-terms, which are the main components of the knowledge, are words and compound words that have specific meanings in a specific context (definition of the term "Terminology": http://en.wikipedia.org/wiki/Terminology).Constructing knowledge bases manually requires considerable labor, cost, and time and can sometimes result in conflict [7,8].Therefore, the automatic construction of a body of knowledge by extracting domain-terms from various sources International Journal of Distributed Sensor Networks is a popular area of research [8][9][10][11][12][13][14][15].Nowadays, Wikipedia (WP) and similar repositories are widely employed as information sources [7,16,17].WP contains diverse forms to explain concepts (hereafter, we use concepts, terms, and article as the same meaning), such as abstract information (specific and long definitions), tabular information, the main article content, article links, and category information.The term "article" is generally used in WP, but it also means "title of article." In this paper, we use "article" and "term" interchangeably.Moreover, WP provides highly reliable and widely used content, because it is based on semantic information from the collective intelligence of contributors worldwide.However, WP has a couple of weaknesses in its category structure (we detail these with examples in Section 2).One is that it has loops in the category hierarchy, and the other is that a significant number of categories are unreasonably generalized.These weaknesses were similarly identified in previous work [7].General methods of extracting domainterms from knowledge, such as for Princeton WordNet [18], use a vertical search (top-down or bottom-up) that chooses a representative term (e.g., science) covering a field of interest and extracts as domain-terms all of the terms (e.g., natural science, life science, biology, botany, ecology, genetic science, morphology, anatomy, biomedical science, medical science, information science, and natural language processing) contained under the representative term [19].Because of the weaknesses identified above, such methods cannot be applied to WP.This research proposes a horizontal method to resolve the difficulties of a vertical search.The method requires one domain category as input (multiple categories are possible, but we consider only the single case in this paper).The entry category contains many articles.We call these domain articles, and each domain article is involved in one or more categories.We consider the categories connected to the domain articles as candidate categories that can be deeply related to the entry category.Then, our method measures the similarity between the domain category and the candidate category.If the similarity matches or exceeds a predetermined threshold, the candidate and its articles are added to the domain category group and the domain article group, respectively.The method generates a similar category group and a domain terminology group through iterative processes and evaluates its category grouping and term collection performance.
The remainder of this paper is organized as follows.Section 2 describes our motivation for this research.Section 3 proposes the domain-term collection method through domain category grouping.Section 4 presents experimental results and evaluates the performance of the proposed approach, and finally, in Section 5, we summarize our research.

Motivation
Many applications employ various WP components for semantic information processing.WP is an agglomeration of knowledge that has been cultivated by contributors from diverse fields; thus, its content has wide coverage and high reliability.In particular, the hierarchical structure of categories and the semantic networks between articles appear similar to the human knowledge system.These strengths allow WP to be widely used; however, unfortunately, additional processes are needed.We indicate a couple of weaknesses of WP in this section.Box 1 shows a case of loop relations in the hierarchical structure, which represents one of the weaknesses.
The category "Natural language processing" has "Concepts" as a supercategory, and each "Concept" has itself as one of the superconcepts a few steps later.The WP hierarchy contains many loop cases, and this poses difficulties during a vertical search.Even if this was resolved programmatically, there would be another obstacle, as shown in Box 2.
Box 2 enumerates the supercategories of "Natural language processing" after its loop cases have been removed.The initial category is a computer science technology, but this soon becomes connected to "Mind, " "Marxism, " "Humans, " "Taxonomy, " "Classification systems, " "Libraries, " "Collective intelligence, " "Internet, " "World, " and "People." Some of the connections are appropriate, but others suffer from excessive generalization between categories.We call this inappropriate generalization, and it causes undesirable categories and terms to be collected in a domain category during a vertical search.Therefore, we propose a method of searching horizontally for related categories.
This research considers a category of interest as the entry domain category and measures the similarity of article intersections between this domain and the other, candidate category.If the similarity is equal to or exceeds a predetermined threshold, the candidate is used as an element of the domain set.Some well-known similarity measures for the degree of article intersection, such as the Jaccard similarity coefficient (JSC) or the Dice coefficient (DC), have significant limitations, as shown in Table 1.
Each case consists of two categories, and we wish to determine whether the candidate can be added to the domain set.In the first case, the categories have the same number of articles, 50 of which are shared as an intersection set.The Natural language processing → Computational linguistics → Natural language and computing → Human-computer interaction → Artificial intelligence → Futurology → Social change → Social philosophy → Human sciences → Interdisciplinary fields Box 1: Case of a loop in the WP category hierarchy (bold: represents the same category concept but it occurs iteratively on the hierarchy).
Natural language processing → Computational linguistics → Natural language and computing Box 2: Case of inappropriate generalization in the WP category hierarchy (bold: represents the same category concept but it occurs iteratively on the hierarchy).measurements mentioned in the previous section.We now describe these processes in detail using real examples.

System Flow.
The proposed method takes one category, which the user selects as an entry (trigger), and follows the flowchart shown in Figure 1.Starting from the entry category, we determine similar categories through a horizontal search.
In the search, articles included in the entry act as "clues" for measuring the similarity and "bridges" for preparing the next candidate category.Figure 2 illustrates an example of a category-article network.If the category "Natural language processing" is given as the entry, the method finds articles for similarity measurement and prepares the categories of each article for the next candidates.This means that "Information science, " "Knowledge representation, " "Machine learning, " "Artificial intelligence applications, " "Data mining, " and so forth are processed individually as candidate categories.We now explain the process shown in Figure 1 using similar examples.

Domain-Term Selection through Category Grouping (Bootstrapping Method).
To group similar categories, we choose a horizontal category search and propose new similarity measurements that enable the group to be enriched.The bootstrapping process proceeds as follows.
(1) An initial domain category (DC) consists of a userselected category: DC = {user selected category}.For example, DC = {Natural language processing}.The length of DC increases throughout the iterative process.
( (3) There are two options to choose whether an articlelink network is used in the similarity measurements.We explain the options using Figure 3 and Table 2.
Figure 3 shows the network between the categories "Natural language processing" and "Data mining, " whereas Table 2 International Journal of Distributed Sensor Networks where, basically, we assign a value of 0.5 to .Based on (1), we can calculate the similarities of Cases 1 and 2 in Table 1 to obtain values of 0.5 and 0.54, respectively.However, we must consider an additional constraint in the bootstrapping method.Table 3 shows another example.Based on (1), we find that both cases have the same similarity value, as shown in Table 3.Even so, Case 2 is inappropriate, because the coverage of DA is too narrow.This may cause the generalization problem.Thus, before calculating similarities in the bootstrapping method, the similarity constraint, should be satisfied.
According to this constraint and (1), the similarity between "Natural language processing" and "Data mining" is calculated as follows where distance is the number of articles that exist on the network (this is different from dist in DA).
According to (3), we can calculate sim (DLS, CA), sim (DA, CLS), and sim (DLS, CLS), which have distances of 2, 2, and 3, respectively.The final similarity (final sim) between DA and CA is determined by summing sim (DA, CA) from (1) with sim (DLS, CA), sim (DA, CLS), and sim(DLS, CLS) from (3).The similarity between "Natural language processing" and "Data mining" can thus be calculated as follows: We have described all of the bootstrapping steps for domain-term selection by grouping similar categories.In the next section, a few aspects of performance are evaluated.

Experimental Evaluations
This section considers the evaluation of DC and DA.One objective of our research is to select as many domain-terms as possible, for which we proposed the bootstrapping method for similar category grouping.In the process, DC and DA become enriched with categories and articles, respectively, and each article has supplementary values of dist, count, and dw.
To evaluate the quality of DC and DA, we used an articlecategory dataset and a Pagelinks dataset that are included among the WP components in DBpedia (DBpedia is a crowdsourced community effort to extract structured information from Wikipedia and make this information available on the Web: http://dbpedia.org/About)version 3.7.To utilize the WP networks, we implemented two versions of our system, referred to as new similarity (NS) and new similarity with links (NSL).We varied the threshold of each system from 0.1 to 0.9.Moreover, we chose entry sets of 40 categories (each set has one category) from fields of computer science, such as "Natural language processing, " "Speech recognition, " and "Semantic Web." The results are compared with those of a baseline method that employs the DC similarity measure.Table 4 shows some of the similar categories collected by NSL for the entry category "Semantic Web" with a threshold of 0.2.
We invited domain specialists to examine the results, and each collection was manually checked by each evaluator.The RE field in Table 4 shows the actual checked results, where values of 1 and 0 were ascribed by the evaluators for relevance and irrelevance, respectively.Tables 5 and 6 present summaries of the DC evaluations.
There was no improvement for thresholds greater than 0.6, and the bootstrapping was incomplete for a threshold of 0.1.Therefore, we present results for threshold values from 0.2 to 0.6.In the experiments, the baseline attained an extension rate of only 1.18 (40 categories extended by 7 categories), even though its precision was 100%.The aim of the information processing is to reduce the time taken to accomplish certain objectives, which implies that the information system should XHTML + RDFa, Bath Profile, Sidecar file, COinS, Metadata publishing, MARC standards, WizFolio, Qiqqa, ISO-TimeML, TimeML, Metadata Authority Description Schema, Bookends (software), RIS (file format), Metadata Object Description Schema, EndNote, Refer (software), ISO 2709, BibTeX, XML, S5 (file format), Semantic HTML, Simple HTML Ontology Extensions, Opera Show Format, XOXO, XHTML Friends Network, StrixDB, Graph Style Sheets, TriX (syntax), TriG (syntax), RDF feed, Redland RDF Application Framework, RDF query language, Turtle (syntax), RDFLib, Notation3, SPARQL, D3web, Artificial architecture, NetWeaver Developer, Knowledge engineer, Frame language, . . .provide varied results.In this respect, we do not expect the baseline results to be helpful.However, it is apparent that NSL provides wide extension and high precision.The maximum extension rate was 24.45 with 799 appropriate categories for a threshold of 0.2.The minimum precision was around 84%, when the threshold was 0.3.In addition to evaluating DC, we evaluated DA with the NSL results for a threshold of 0.2.To examine the influence of the distance, count, and domain weight, we analyzed the results according to each factor.Six DAs were selected at random, with a total of 1,769 articles (terms).Box 3 enumerates a part of collected articles for a domain "Semantic Web." And Tables 7, 8, and 9 show the evaluation results with respect to distance, count, and weight.
The basic performance of the domain-term selection attained precision of 70.9%.As expected, the precision was inversely proportional to the distance; however, a distance of 4 produced almost all of the unrelated articles.The weight and count could be used as important criteria to select domain-terms; we found that the weight returned more refined results than the count (the weight returned 755 appropriate terms with 97.2% precision at a threshold of 0.3).At short distances, there are many names of people and organizations, such as "Squarespace, " "Rackspace Cloud, " and "Nsite Software (Platform as a Service)." These names were selected by the bootstrapping because their associated categories (e.g., "Cloud platforms, " "Cloud storage, " and "Cloud infrastructure") were similar to the entry (e.g., "Cloud computing").This situation is not caused by our method, but by the definition of the article-category relations of WP.We believe that this can be resolved by processing content (abstracts) or tabular information in the future.

Conclusions and Future Work
This paper has proposed a method of domain-term collection through a bootstrapping process to assist the semantic interpretation of data from sensor networks.To achieve this, we identified weaknesses in the WP category hierarchy International Journal of Distributed Sensor Networks 9 (i.e., loops and inappropriate generalizations), and chose a horizontal, rather than vertical, category search.We proposed new semantic similarity measurements and a similarity constraint to surpass existing methods.Moreover, we employed category-article networks and article-link networks to elicit information for the category similarity measurement.In performance evaluations, our category grouping based on NSL yielded the greatest number of proper results.In terms of domain-term selection, we confirmed that the results obtained with normalized weights had the best precision and extension rate.The distance-based metric had no positive influence on our research.When the distance was greater than three, almost all of the terms were unrelated.However, we believe that the collected domain terminologies can assist the construction of domain knowledge bases for the semantic interpretation of sensor data.
WP has additional weaknesses to those mentioned in this paper, especially in the category-article relation.For example, the term "Paco Nathan" is a personal name that has "Natural language processing" as one of its categories.The relation between the two, that is, "Paco Nathan" has expertise in "Natural language processing, " causes noise and negatively influences semantic information processing.We think that this problem can be solved in future work by processing additional WP components, such as abstract or tabular information.Moreover, our research employed only the out-links of WP articles.If the in-links were considered, we expect that the results would be more significant, with wider coverage of domain-terms and higher relevance.

Box 3 :
Domain articles (DA) collected for "Semantic Web" on NLS with threshold value 0.2.

Table 1 :
Similarity issues for bootstrapping methods.
defines the network types.
(i) The first option considers only category-article networks, such as {Content determination, Information retrieval, Languageware, Concept mining, Document classification, Text mining, Automatic summarization, String kernel, Sentic computing, . ..} for "Natural language processing." Type 1 in Table 2 is related to this option.(ii) The second option uses more complex networks that utilize the category-article network as well International Journal of Distributed Sensor Networks

Table 2
If there is an intersection article, then an intersection set IS is constructed to measure the similarity:  (set  , set  ) = (set  ∩ set  ) ⇒ {  , 1 ≤  ≤ }.

Table 2 :
Network types and examples: C denotes a category; A, an article; and L, a link.

Table 3 :
Examples of similarity measurement.
(10)If there is at least one element in DA, we return toStep 2 for the new CC of the next domain article; otherwise, proceed to Step 19.(11) To use the links of articles as additional clues for the similarity measurement, a domain link set (DLS) is collected: DLS(DA) = {link dom j , 1 =< j =< m}.For example, DLS = {Information, Metadata, Relational database, World Wide Web, Data (computing), Document retrieval, . ..}. (12) This step is the same as Step 4. (13) This step is the same as Step 5. (14) Construct a candidate link set (CLS) with links from CA. Explicitly, CLS(CA (cat k )) = {link cat  , 1 =< t =< q}.For example, CLS(CA (Data mining)) = {Computer science, Data set, Artificial intelligence, Database management system, Business intelligence, Neural network, Cluster analysis, . ..}.

( 16 )
If the similarity exceeds the predetermined threshold, go to Step 17; otherwise, Step 18 is carried out.(17) If the similarity exceeds the threshold, the system enriches DC, DA, and DLS.This step is similar to Step 8. (18) If there is at least one element in CC, we return to Step 12 for the next candidate category; otherwise, go to Step 10. (19) Output the DC and DA acquired through the bootstrapping process.(20) Terminate the bootstrapping process and evaluate DC and DA, which are increased in Step 8 or 17.For DA, the evaluations are divided into three types by supplementary value (dist, count, and dw).In the case of domain weight, we use values normalized according to weight  (art  ) = weight (art  ) arg art  ∈DA max weight (art  ) .

Table 4 :
Domain categories collected for "Semantic Web" on NLS with threshold value 0.2.

Table 5 :
DC evaluations with baseline and NS.

Table 7 :
DA evaluations based on distance, including articles within the indicated distance.

Table 9 :
DA evaluations based on normalized weight.