Enabling social web for IoT inducing ontologies from social tagging

Semantic domain ontologies are increasingly seen as the key for enabling interoperability across heterogeneous systems and sensor-based applications. The ontologies deployed in these systems and applications are developed by restricted groups of domain experts and not by semantic web experts. Lately, folksonomies are increasingly exploited in developing ontologies. The “collective intelligence”, which emerge from collaborative tagging can be seen as an alternative for the current effort at semantic web ontologies. However, the uncontrolled nature of social tagging systems leads to many kinds of noisy annotations, such as misspellings, imprecision and ambiguity. Thus, the construction of formal ontologies from social tagging data remains a real challenge. Most of researches have focused on how to discover relatedness between tags rather than producing ontologies, much less domain ontologies. This paper proposed an algorithm that utilises tags in social tagging systems to automatically generate up-to-date specific-domain ontologies. The evaluation of the algorithm, using a dataset extracted from BibSonomy, demonstrated that the algorithm could effectively learn a domain terminology, and identify more meaningful semantic information for the domain terminology. Furthermore, the proposed algorithm introduced a simple and effective method for disambiguating tags.


INTRODUCTION
Semantic domain ontologies are increasingly seen as a key factor in automation of information processing. Recently, semantic web technologies are integrating to Internet of Things. These ontologies play an essential role for integrating IoT data and web information systems. Applying such ontologies to IoT would better enable "things" to work in co-operation and also would enable autonomous interaction between "things" [1][2][3][4][5][6]. However, ontologies development by domain experts is a time-consuming and expensive process. Moreover, the ontologies deployed in the current sensor-based applications are developed by restricted groups of domain experts and not by semantic web experts. In this context, social tagging data contributed by millions of online users represent an essential and continuous source for the "collective intelligence", which are increasingly seen as an alternative to the current effort at semantic web ontologies [7][8][9]. The ontologies derived from folksonomies can give a machine-processable form of the Social tagging data representing online communities' collective intelligence rather than the perception of a limited group of experts. As such, they would be able to capture changes derived from a more diverse user population. Therefore, they would become semantically richer and thus handier for logical reasoning tasks [10]. Unfortunately, social tagging systems share the problems inherent to all uncontrolled vocabularies, such as ambiguity, synonymy, and the lack of hierarchy. Thus, knowledge extraction from the social tagging data remains a challenge not solved yet. In this paper, we introduce an algorithm for inducing domain ontology from social tagging data. Experimental results, on a snapshot of dataset from BibSonomy, showed that the introduced algorithm could effectively capture domain-specific concepts, and enrich these concepts with semantic information extracted from Wikipedia.

SOCIAL TAGGING SYSTEMS
Social tagging websites enable users to assign free-chosen tags to categorize their digital content (such as websites, pictures, videos etc.) over the Web, forming the so-called folksonomies [11]. Currently, many web-based services foster the concept of tagging. These systems can be differentiated according to the kind of resources supported in. For instance, Delicious for sharing bookmarks, Flicker for photos, BibSonomy for publications and bookmarks and YouTube for sharing videos. The basic principle of these services is simply to allow registered users generating the content and classify it in their own unique way by assigning arbitrary tags to this content. Researchers attributed the success of tagging to the fact that no specific prior knowledge is required to tag, and the immediate benefit of tagging [12], [13]. From a knowledge organization point of view, folksonomies have two main advantages: Social tagging systems provide a vast amount number of user-generated annotations and directly reflect users' vocabularies and interests; they are relatively cheap to develop and harvest as they emerge from end users' tagging [12][13][14]. These advantages have turned Social tagging systems into an interesting data sources for Semantic Web applications [7], [8], [14], [15].

RELATED WORKS
Much work has been done to introduce semantics in folksonomy [16][17][18][19], and to investigate methods of deploying this semantics for tasks such as information retrieval [20][21][22], recommender systems [23][24][25][26], and ontologies development [27][28][29]. As well, quite a number of works has been done to extract structured knowledge and develop ontologies from social tagging systems. The early studies explored means of leveraging the co-occurrence statistics of tags and the tripartite structure of folksonomies to measure tag relatedness (e.g., [30][31][32][33]). More recently researchers (e.g., [28], [34], [35]) proposed to make tags semantics explicit by grounding them to corresponding entries in online knowledge bases, such as WordNet and DBpedia. Although these approaches are more precision [36], but approaches heavily dependent on WordNet get poor recall due to the fact that many of the tags from folksonomies do not exist in WordNet. In general, there is a lack of methods that extract domain-specific ontologies from folksonomies. Our algorithm produces baseline domain ontologies from tags in folksonomies. The proposed algorithm collects domainrelevant terms from tags relying on a set of domain keywords extracted automatically from Wikipedia pages titles. Then, it identifies the exact meaning of the terms and retrieve semantic information about each term.

INDUCING DOMAIN ONTOLOGY
Our algorithm takes the name of a specific domain and a prepared folksonomy dataset as inputs and produces a corresponding domain terminology as output. This algorithm first represents folksonomy resources as an undirected weighted graph. Next, it collects a domain terminology through traversing the resources graph relying on a set of domain keywords extracted automatically from titles of Wikipedia entries. Finally, we extract semantics information about the collected domain terminologies by linking them to their appropriate Wikipedia entries. This includes identifying the intended meaning, attributes and synonyms of the domain terminology. The general method is shown in Figure 1.

Pre-processing
The pre-processing activity is an important task as it guarantees the quality of the data over which the process is going to be carried out. This activity includes deleting special characters, duplicated tags and prepositions. Furthermore, we used a lexical vector to exclude non-objective tags that caused noisy connections between the resources on the resources graph [17], [23].

Resources graph generation
A folksonomy can be seen as a tuple A: = (U, T, R), where U, T, and R, are finite sets, whose elements are called users, tags and resources, respectively. Folksonomy can be represented as an undirected tri-partite hyper-graph G = (V, E) where V = U ∪ T ∪ R, is the set of vertices and E = {(u, t, r) | (u, t, r) ∈ A} is the set of edges; the tri-partite graph can be folded into two and one-mode graphs [7], [37]. In this work, we adopt this definition using the on-model graph G'= (V', E') in which V' represents the set of resources, and E' represents the set of weighted edges. Two resources (ri , rj) will be connected if they share at least one tag. In the following section, we describe how to traverse this graph in order to collect the relevant domain terms. To implement this phase, we use JGraphT library, which is a free Java class library that provides mathematical graph-theory objects and algorithms (http://jgrapht.org/).

Domain Terminology Collection
In this activity, our algorithm starts by extracting a list of Domain Keywords from the titles of Wikipedia articles and redirection pages contained in the main Wikipedia category corresponded to the domain at hand. Based on this Domain Keywords list, we select a set of resources as starting points (SP) for traversing the graph. For a resource, to be marked as starting point, at least two-thirds of the tags assigned to this resource should be found in the Domain Keywords. Next, our algorithm traverses the graph G' many times (starting from SPs) looking for resources that are relevant to the domain at hand. The tags that are associated to the returned resources will be collected as domain terminologies. In more details, throughout the traversing process, we applied a ranking function over each visited vertex. The ranking function rates the relevance of a vertex to the given domain based on the number and weight of the paths coming from the different seeds to it (See Figure (1), adapted from [28]). Resources that have a ranking value greater than a defined h threshold have been marked as domain-relevant resources, and hence all their associated tags have been gathered as domain-relevant terms. To traverse the graph, we use the breadth first search (BFS) method; once the graph being traversed starting from a particular seed, the traversing process stops whether reaching another seed or reaching a terminal vertex. = | ∈ | , , ∈ ∩ ∈ | , , ∈ | | ∈ | , , ∈ | * Let us consider is the previously visited vertex from which we reached , d is the distance between the current vertex and seed.

Concepts Identification
By concepts identification, we mean to identify for each term the appropriate Wikipedia article that represents its intended meaning so that we can standardize names of the terms and enrich them by adding their categories and their possible synonyms as well. See the example depicted in Figure 2. This activity also includes disambiguating terms and extracting semantic information about them as well. The advantage of using Wikipedia as a reference to map terms is that Wikipedia is a community-driven knowledge base, much like folksonomies are, so that it rapidly adapts to accommodate new terminology. Many of the popular tags occurring in folksonomies do not appear in grammar dictionaries, such as WordNet, because they correspond to proper nouns, modern technical words, or are widely used acronyms. In addition, the redirect pages in Wikipedia provide synonyms and morphological variations for a concept. For example, when searching the tag 'nyc' in Wikipedia, the entry for New York City is returned. To perform this task, we used Google as an intermediary to retrieve the appropriate corresponding Wikipedia article for each term. Firstly, we passed to Google a term enclosing between the domain name (in this example: "Web Development") as a context and the word ("Wikipedia") to bring Wikipedia pages to the top. Then, we look for a morphological matching between the term and the titles of the top four retrieved Wikipedia pages. The simplest case occurs when a term can be matched directly to the first Google result. In other cases, a term could be matched directly to a page title, to a part of the title, or to one of the redirected pages. As well, terms could be matched to abbreviations that come with the Wikipedia entries' titles enclosed between parentheses. In some cases, matching to Wikipedia entries fails.
In fact, querying Wikipedia through Google allows taking advantage of techniques embedded in Google, such as stemming and lemmatization, so that we have a high chance of finding the correct corresponding Wikipedia articles. As it shown in Figure 2, passing the term 'CSS' to Google resulted in retrieving the Wikipedia article entitled 'Cascading Style Sheets' since CSS represents a redirect page to this article in Wikipedia. In the case of disambiguated terms, (for instance the term "Ajax" that could refer to a programming language or a mythological Greek hero), the Wikipedia article that represents its intended meaning comes first in the Google results due to using the domain name as context in Google search box. However, we use information available on the returned Wikipedia articles to enrich the terms. These includes redirect pages as alternative names, and Wikipedia categories containing that page that are listed on the bottom of each article.

DISCUSSION
The lack of evaluation frameworks and the lack/incomplete of electronic resources that can be used as a gold standard makes the process of evolution a terminology difficult [17], [38]. Besides, folksonomy tags are uncontrolled vocabularies that contain many slang words and abbreviations, while the electronic resources often use formal and compound terms. However, the experiments were performed on dataset, captured from BibSonomy[39], composed of 20,000 resources annotated by 85,006 tags (11,865 unique tags). Three domains of computer science have been selected randomly for the experiments: Semantic Web, Computer Networks, and Web Development. To evaluate the obtained terminology, we used majority voting of five researchers who were asked to make judgments of domain relevancy (how strongly a term is relevant to the given domain) for all the obtained terms by associating a label "relevant", "irrelevant", or "uncertain" with each term. Table 1 shows results we obtained; where the "Distinct Terms" column shows all obtained terms after removing duplicated items, and the "Relevant Terms" column shows the terms marked as domain-relevant terms. We calculated the precision of the obtained results as follows: Precision=(|relevant|) *100 / (|distinct terms|), where distinct terms refer to the all unique terms we obtained. Formally, distinct terms = relevant ∪ irrelevant ∪ uncertain where "relevant" refers to the terms that were marked as domain relevant terms; "irrelevant" refers to the terms that were marked as not domain relevant terms and "uncertain" for unobvious terms.

23
Wikipedia entities, as Google can recognize words morphology. Nevertheless, some terms could not be correlated to a Wikipedia article due to missing completed context (e.g. "usability" term cannot be linked, but "web usability" can be). In other cases, terms cannot be matched due to the variant structures of compound terms. For instance, matching the term "DHML" to the article labelled "Dynamic Html" fails although they refer to the same concept. Some terms, such as W3C, XML, could be considered relevant to several domains. As we addressed in early, generating the domain keywords plays an important role for obtaining a good recall. In this context, redirect pages in Wikipedia serve as good domain keywords, as folksonomies contain much neologisms and acronyms. However, for generating another domain concepts by our algorithm, domain experts may be involved in selecting the suitable dataset for a given domain. Users utilize folksonomies with various intentions. For instance, Delicious is used for general purpose whereas BibSonomy primarily serves academic and scientific interests. Compared to general folksonomy, academic folksonomy has a more complex nature in terms of semantics and sparsity of the data [40][41][42]. Therefore, academic folksonomies would be more useful for building ontologies (particularly, ontologies for scientific domains). Nevertheless, general folksonomies would be more suitable for extracting concepts of general domains such as Movies, Transport. This issue may be considered in our future work. Besides developing a method that looks for corresponding entries on the different online knowledge sources for terms that cannot be mapped to Wikipedia.

CONCLUSION
Tag-based systems have become widely available thanks to their advantages, which include selforganization, currency, and ease of use. The bottom-up nature of these systems has proved to be an interesting knowledge source, since they provide a rich terminology generated by potentially large user communities. This paper addressed the problem of how to harvest and exploit embedded semantics in social tagging systems for developing semantic ontologies. The evaluation of the algorithm, using a dataset extracted from BibSonomy, demonstrated that the algorithm could effectively learn domain ontology concepts, and identify meaningful semantic relations for the extracted concepts. Furthermore, the proposed algorithm could help in reducing common problems related to tag ambiguity and synonymous tags.