Meliorating usable document density for online event detection

ABSTRACT


INTRODUCTION
The recent few years have seen a boom in the number of users on social media platforms and the amount of data produced in them.In 2017, there were a reported 288 million [1] social network users.The data produced from all these users vary from being promotions to thought sharing, or sometimes reporting about an incident or event that the user experienced directly or indirectly.All of the information posted is subject to their own experience, understanding, research, and findings.Monitoring events over social streams have many applications, such as crisis management and decision making.Owing to human nature and psychology, this information is subject to social media sites as soon as the incident occurs or even as the incident is occurring.In contrast to the traditional sources of information and news (such as TV stations, radio stations, and newspapers) that have to undergo a background check on the incident before reporting that usually takes time, monitoring the information on social media through users' posts may ensure that not only will the information be available quicker, but it is also possible getting firsthand information from a larger number of [1] social media participants.For instance, during the recent earthquake in Nepal, with the local media information unavailable to them, the international media took some time to validate the rumors and move in to report them.However, at the same time, in a time window of about a few hours, many tweets, Facebook status, pins on Pinterest were being shared with firsthand reporting of the events.If the information in these posts can be analyzed, a lot of firsthand information will be available in a short time.
This property of social media reporting gave rise to a new area of research, named online event detection (OED) [2]- [4], where a system tries to identify a possible event as quickly as possible through the recent posts.Most of these systems generally work in three phases.First, it continuously accumulates documents (the user posts) from a particular social media platform.These systems then perform clustering based on an existing or proposed similarity metric.Finally, they indicate the significant clusters as possible events.For the second phase, most of these systems try to group the user posts based on the textual and contextual similarities among the documents [2], [4]- [6].This implies that the words used in these documents are of utmost importance for correctly identifying their association with a particular event.
A closer look at the popular social media platforms can show that one of the common properties across the platforms is the reply.This particular property can adversely affect the performance of such systems as the reply documents, at times, tend to replace important words with pronouns.As a document repaces a particular word with a pronoun, adding the document does not increase the number of occurrences of that word in the cluster, thereby not increasing the strength of this word.Therefore, the impact of this document reduces with respect to the cluster in question.
This article focuses on this particular property of social media to increase the usability of such documents and inherently increase the density of usable documents for the purpose.The critical aspect here is to update these documents such that they are appropriately clustered in their relevant events.It can be seen that the existing articles ignore the same and try to compensate with a more significant number of documents, where a higher number of documents also indicate added delay and higher memory requirement.In this paper, we present a method to augment the density of the event clusters by performing pronoun resolution to each incoming documents before applying OED algorithms.The primary contributions of this article are as: i) the proposed work aims to perform a parts-of-speech-based pronoun resolution between the reply document that contains the set of pronouns and the main document to which this document was posted.The resulting document can greatly affect the similarity measurement between the main and the reply document as more similar words will be present; ii) there can be a hierarchy of main-reply documents, and the pronoun may not correlate to the direct parent.The approach, therefore, traverses the hierarchy to identify the most appropriate word for the pronoun in question; and iii) as the system aims to perform in run-time, searching a considerably large hierarchy of documents for antecedent words may significantly delay the outcome.Therefore, we propose a selection window-based antecedent search so as to reduce the possibility of unaccounted delay.
We use the Twitter platform for simulation purposes as the data from this platform are easily downloadable, and many such datasets are also available.The experimental results demonstrate that the resulting dataset from the proposed approach significantly increases the performances of the existing OED algorithms compared to the raw-data usage.The rest of the article is organized as follows.Section 3 discusses the background and related works of OED algorithms and pronoun resolution techniques.In section 3, we propose our information augmentation technique for pronoun resolution for the streaming data.The results of the proposed algorithm are analyzed and discussed in section 4. Finally, the section concludes the paper and provides future directions in the area.

BACKGROUND AND RELATED WORK
The term 'Event' carries a diverse range of meanings depending on the context, but the idea behind it remains unchanged.Anything important, unfamiliar, or abnormal happenings in a normal context can be treated as an event.This could be a part of a chain of incidents as an effect of a preceding or as the cause of succeeding incidents.The process of discovering these happenings or incidents through social media data is termed 'online event detection'.An event detection technique [7], [8] can be either proactive, where categories and properties of possible event types are known, or reactive where the systems assume that an event type may be previously unknown.Primarily, any such system has four major processing phases: gathering raw data, preprocessing, term weight estimation, and grouping documents based on their similarity.Traditionally, events were detected by gathering [3] historical data such as news articles, radio broadcast articles, and classified columns.Due to the advent of social media streams, an enormous amount of real-time data has become obtainable [1].The present situation is when an event occurs, a majority of times they are first and foremost posted on various social media sites (SNS), and then the news follows.Hence, highlighting the importance of social media streams in event detection, as a result, SNS gained popularity as a perfect source of event detection.Recent works such as Dakle et al. [9] on co-referential methodology performed on emails related text, Wright-Bettner et al. [10] on cross-document co-reference, where the goal of authors in [11] specifically on mission-related objects, locations, and actors, therefore, annotate a dataset of reference links, with the inclusion of coreferences.

Online event detection
Discovering topics from a document, or any kind of contextual source of information was initiated by topic detection and tracking (TDT) [3] project.This concept further moved on to integrate the time-ISSN: 2252-8776  varying property [12] by incorporating online data from social streams.Data gathering plays an important role in OED.Without hampering the fundamental meaning of the input data streams, upgrading it is the core task of our work.The choice of data sources moved gradually from traditional media (news articles, T.V. broadcast, and article stories) to blogs, e-mails, and micro-blogs [4], [13]- [16].due to the limited dataset availability on specific topics.The notable changes in data sources can be observed with the appearance of SNS [4], [5], [16], [17].These steers drastic changes in event detection techniques due to their distinct characteristics.Twitter gained the highest popularity [1], [5] in the OED research field as it is very convenient to get the metadata based on certain requirements and can easily be gathered through their application program interfaces (APIs).To design a well-shaped input stream, text cleaning plays a significant role as raw data are highly unstructured, flooded with irrelevant texts, unimportant words, and spelling mistakes as people replicate the spoken language in textual form.Hence, it is essential to filter out those words or a set of words to reduce further processing complexity.Becker et al. [4] detected potential events from Twitter data through incremental clustering by estimating the similarity using the tf-idf term weighing metric.They performed basic cleaning operations, such as stop-word removal, and stemming on tokenized terms.Hasan et al. [5] perform event clustering using two modules, firstly, the search module identifies the unique tweets, and secondly, an event clustering module that processes tweets that are not unique for the existing clusters by estimating the similarity between them.Elimination of user names, mentions, stop-words, and web links are done as pre-processing before applying the proposed event detection method.Nguyen et al. [1] first proposed to normalize tweets to get all potential terms, then to monitor the occurrences of terms, generate signals for each of them, and finally extract features to estimate the similarity of tweets to create potential clusters.To normalize the tweets, all the terms are converted to lower case letters, extra spaces are eliminated, removed multiple repetitions of letters within words, user mentions, URLs, and hash symbols.Guille and Favre [18] produced a list of events, where each event is described by three different things-the main word, a set of weight-related words, and the time-span and magnitude of events.Statistical methods and external sources, excluding social media streams are used to improve the accuracy.They also applied stop-word removal techniques using a normal stop-word list for text processing.Nikolaos et al. [15] talked about long-term and short-term event detection scenarios from the perspective of time.They redefined the inverse document frequency (IDF) score for the timevarying scenarios with fuzzy representation.Using natural language toolkit (NLTK) in python, their proposed technique counted the term frequencies, synonyms and abbreviations are handled using the lexical database of WordNet.It also removed URLs, slang words, common verbs, and nouns.The concept of high utility pattern mining [17] has been taken to detect event topics in this work.For pre-processing, the raw tweets of all the characters are converted into lower-case letters, performed tokenization, stop-word removal, dealt with HTML tags, removed URLs and special symbols.The method in [16] detects events through incremental clustering by checking similarity among documents.A new similarity metric is introduced embedding structural property with the textual property.To pre-process, these basic natural language processing (NLP) steps such as, stemming of words, and removal of stop-words are performed.A latent event and a category model are developed and proposed in [19] to discover the events from Twitter data.NLP tools, such as word-tokenizer, part of speech (POS) tagger, stemming, named entity recognition (NER) mapping of 'today', 'tomorrow' with the published date of tweets are performed.
We observed that, though the existing works for OEDs may follow different methodologies to identify or discover the events, the initial tasks of pre-processing are very much similar for almost all of them.Moreover, it can also be seen that the OED system mostly uses a term weighing metric for the clustering process, thereby establishing the importance of common words to be used in all the relevant documents so that they can be correctly clustered.Therefore, the documents where the prominent noun phrases are replaced with pronouns will suffer from reduced term weights, resulting in wrongly clustering these documents.Most of the OED systems would eventually suffer from this problem as they do not resolve the pronouns, and also, they remove these pronouns in their pre-processing phase, thereby eliminating the possibility of pronoun resolution further.A system that can resolve these pronouns before pre-processing, may therefore increase the efficiency of any such OED system by providing more relevant words in the documents.In the field of linguistics, this technique is termed anaphora resolution (AR).So far anaphora resolution is used in the retrospective datasets and dialogue textual documents.The motivation of our work is to increase the data density of streaming data through pronominal resolutions which make the process more challenging.The next section is a brief description of existing works of pronominal resolution found in the literature.

Identification of subjective words
In linguistics, both coreference resolution and anaphora resolution techniques are related in the context of identifying an antecedent or a subjective term of the referred term such as pronoun [20].The choice of computation of anaphora resolution has shifted to the machine learning approach from heuristics approaches due to the advancement of statistics in linguistics and publicly available annotated corpora.The corpora, that are retrospective or historic [21] in nature, consist of news stories [22], technical manuals [23], or conversational text [24]- [26].Next, we will emphasize these areas.
In 1972, the first pronominal resolution approach was proposed by Winograd [27].All the preceding noun phrases are considered as a candidate of an antecedent and rated based on their syntactic positions.Another early work on the syntactic constraint-based approach is Hobb's algorithm [28].The algorithm searches the possible antecedent of the pronoun on a tree.The algorithm evaluated a corpus of news articles.The concept expanded and was enhanced gradually by appending several other features.A multi-stage approach [29] considered multiple knowledge of sentential syntax, case frame semantics, dialogue structure, and general world knowledge to resolve anaphors.The identification of antecedent of third-person pronouns (he and she) and lexical anaphors are taken into account by Lappin and Leass approach [30].A revised and updated version of this approach is developed by Kennedy and Boguraev [22] as the algorithm runs on the output of the POS tagger instead of full, in-depth syntactic parsing.The robust approach by Mitkov [20] approach identifies antecedent noun phrases within two preceding sentential distances from the anaphor.As an input, a text is passed through a POS tagger through the antecedent scoring system antecedents are identified.The machine learning approach first introduced by Connolly et al. [31], aims to resolve the referred word.To identify two coreference noun phrases, a binary classifier or mention pair model has been proposed.The classifier can be trained by any off-the-shelf learning algorithm such as support vector machine (SVM), maximum entropy, and deep neural network.To maintain the transitivity property of coreference resolution, a separate clustering (agglomerative or graph partitioning) is performed to select the closest preceding antecedent (by closest first clustering) or the highest coreference likelihood (best first clustering) antecedent.The graph partitioning approach in [21], [32] also employed for this purpose where the nodes represent antecedents, and the edges represent the possible weights of connected antecedents.To train the classifier MUC6, MUC7, and ACE5 corpus are taken.Hence, the nature of the dataset is static or retrospective.
Turn structure is an organization of dialogue structure where people speak in an alternative manner, one by one.This feature gained huge attention [26], Stent and Bangalore [25] worked on a specific conversation-based dataset to improve the relative performance.A detailed study of the turn structures is discussed in detail by [33], using a specific dataset of dialogues on the tutoring system.The procedure considered the location information of the candidate antecedent, to analyze the corpus.Ritter et al. [34] used named entity recognition (NER) on tweets as the pre-processing task which may lead to better results in performance, but the linking between pronoun with its corresponding subjective nouns are not familiar.Resolution of first-person pronouns (I, me, and mine), and second-person pronouns (you and yours) can be observed in [24] on a static dataset.To resolve pronouns with non-nominal antecedents, switchboard corpus is used in [35], however, they fail to highlight the need of using such features.

INFORMATION MELIORATION IN OED
During clustering of the documents, a lot of information is lost or is not clustered properly due to the absence of relevant nouns or other words in the document.All the main posts related to an event mostly have a similar matter or carry similar information.New information is added as new documents or reply messages and is pushed into the relevant cluster.People use reference terms such as pronouns or abbreviations to refer to the main subject while writing responses/replies.Hence, the significance of nominal terms can be enhanced by establishing a link with their pronouns.Generally, pronouns are eliminated during preprocessing or data cleaning by a stop-word removal technique; sometimes abbreviations are also filtered out at this phase.Due to this phase, many of these cleaned documents are wrongly clustered.Before removal of stop-words, finding the association between reference words in the reply documents and their subject in the main posts may enhance the clusters' information density.So far, very few existing works in the field of event detection have given importance to this.Therefore, the main aim of this work is the augmentation of information density of event clusters by establishing the relation between reply posts with their main posts for streaming data from micro-blogging online social networking sites.For convenience, we consider Twitter as datasets that are available and can further be downloaded as streaming data, though the proposed system can be applied to any such platforms that allow posting replies.
Existing AR pronoun resolve approaches are based on static data [23], [25], [26].These algorithms can process each document from the beginning to the end of the dataset in each of the iterations.However, in streaming data this is not feasible, making it not suitable for streaming data.To apply these methods, buffering of streaming documents become necessary before the process, thereby inducing a delay in processing and huge memory requirement (on avg.350,000 tweets posted per minute).Therefore, the need for an A.R. algorithm that is less dependent on historical documents is needed for storing data.Moreover, Meliorating usable document density for online event detection (Manisha Samanta) 89 rule-based A.R. approaches are best suited for the purpose as they are less dependent on large corpus by performing pronominal resolution on a limited and fixed number of antecedent statements.

Pronoun resolution to meliorate document density
The proposed approach is of two folds.Firstly, it prepares a text window of a limited number of reply messages for each of the primary documents, and secondly, it tries to resolve pronouns of the reply messages within this text window.Social media text datasets, a dataset from Twitter, for example, represent streaming texts that are knowledge-poor by nature, that is, the texts are not POS tagged.Here we propose an antecedent tracking technique that can be applied to a set of all possible candidates found in the text window of a particular number of previous tweets.An aggregate score is then assigned to all the possible candidates for the antecedent of a given pronoun after the successful execution of multiple scoring criteria.The candidate, that achieves the maximum aggregate score, is treated as the antecedent.Therefore, POS tagging plays a major role in this scoring system.
All the nouns of the text frame and all pronouns of the current post are identified first.The set of all nouns and pronouns, or in other words, referencing terms, are listed out.A list of all nouns is taken as the input of the scoring system after checking for animacy, gender, and number agreement.As we are considering streaming data, the number of previous documents to consider can prove to be very large, therefore it is important to limit the preceding reply window for antecedent tracking.We propose a window size W to set up the window size of the preceding related reply tweets depending on the applicative nature of implementation.The value of W may affect the antecedent tracking system in the following way.A large W will cause increased delay, however, may have more candidates for tracking.On the other hand, a smaller W may reduce the candidates, but will cause smaller delays, which may prove beneficial for real-time tracking.The study of W is out of the scope for this article and for the rest of the paper, we assume the value to be $3$.We assume a smaller value since in social media the readers normally can view only the most recent replies therefore, they tend to write replies on these visible documents.Even if suitable tweets for the window are not found, intra-sentential referencing has to be performed and proceed next to maintain the online environment.Here, we use the term `tweet', `post', and `sentence' interchangeably.
In the proposed method, we consider the '@' symbol as users directly refer to other persons by tagging them with this special character.We also consider the 'hashtag' (#) symbol, as most of the '#' tagged words are noun phrases, and are generally used to highlight some significant matter in social media.The following method is performed on the text window W for a particular document.Let us consider the window of document d as Wd.First-person singular pronouns like 'I', 'me', and 'my'.can be replaced with the 'username' of the post.Similarly, the first-person plural pronouns like 'we', 'our', and 'ours', can be resolved if the '@' symbol is present in the current tweet.These pronouns should be resolved by the 'username', who has posted the tweet, and all the user names who are mentioned in the tweet using the '@' symbol.The pronoun 'it' usually refers to non-living entities, and can be resolved according to the animacy agreement [23].To resolve second person pronoun like 'you' in case no '@' symbol is found in the current tweet, it can be resolved to the user in the tweet for which the current tweet is a reply.On the other hand, if '@' is present in the tweet, then the pronoun 'you' can be resolved by all the usernames mentioned using the '@' symbol.The phrases mentioned with the symbol '#' are considered as the candidates for the noun phrases to which the pronouns can be resolved, where the abbreviations are also considered as the candidates for the noun phrases.Next, we discuss the design, formation, and usage of the text window Wd.

Designing the reply window
As mentioned before, the sentential text window of size Wd is proposed here.This window represents the combination of the main tweet and the corresponding recent reply tweets.The pre-processing steps, such as a basic cleaning like removal of punctuation, conversion of all the letters to lower-case letters, removal of URLs, and stop words are performed after fetching the first tweet.Through POS-tagging, noun phrases in all the tweets within the frame are identified.The output of POS-tagger is used for animacy, gender, and number agreement checking, afterwards, the antecedent indicator scoring system is invoked to assign scores as discussed before.Algorithm 1 provides the algorithmic description of the proposed approach.Algorithm1 first creates a list of most recent tweets that have parent-child relations, 'sentence [1]' being the latest and the target tweet.The functions 'preprocess(d)' and 'fetchParent(d)', perform preprocessing as discussed before and fetches the parent tweet for the document d.It should be noted that the function 'fetchParent(d)' can be implemented in two ways.The simplest way would be to fetch parent tweets online when required and run all the steps as the algorithm suggests, however, this will result in a slower implementation as it will depend on the communication latency.The other way is to store the data locally after the resolution, for future usage.It can be reasoned that this process will attract huge memory requirements.The procedure then creates a list of pronouns for the target tweet and a list of nouns for all other parent tweets.These two lists are further used to resolve the pronouns to their relevant nouns through  p, S, t) and further updated on Sentence [1] with the help of replaceDoc().The details of the function pronounResolver(p, S, t) will be discussed next.

Algorithm 1. Increasing information density of tweets by pronoun resolution
Result: Pronoun resolved tweets

Antecedent scoring system
The primary aim of the proposed approach can be termed as an antecedent scoring system that tries to search for a suitable noun phrase for a given pronoun that appeared in the candidate reply tweet.Though there are few pronoun resolution systems, such as [23], [33]- [35], that try to resolve the same through the contextual properties, and even fewer articles that use platform-specific structural properties such as [36], to the best of our knowledge there is no proposal in literature till date that try to use both structural and contextual properties for pronoun resolution for streaming data from social medial.Therefore, in our proposal, we try to employ the structural properties that a social media platform provides, along with contextual properties, similar to [23], of the related sentences to achieve our aim.The Algorithm 2 explains the procedure for the method pronounResolver(p, S, t), where p, S, and t represent the pronoun to be resolved, set of all the candidate noun phrases for p, and the tweet where the resolution is to be performed, respectively.This algorithm first employs the structural properties of Twitter to perform possible pronoun resolution as can be seen until line number 17. Twitter provides much structural information that is used in this proposed algorithm, such as, author name of the tweet (authorName(t)), parent tweet id in case the document in question is a reply (fetchParent(t)), user mentions using '@' (unameSet(t), returns set of all user mentions).Apart from these, the method remMismatchGenderNumber (S,p) removes the noun phrases in S that do not apply with gender and number agreement with p, and tweetFormat(t, {'it', verb}), returns FALSE if the format {'it', verb} is not present in the tweet t.
Finally, in case the structural resolution fails to find a match for p, the algorithm proceeds further to contextual properties to perform the same through the method maxSimilarity(p, S).Empirically identified subjective antecedents are used in this method to assign scores to the candidate nouns to identify possible noun phrases.These assigned scores are related to salience, structural matches and referential distance, and preference of terms.As previously mentioned, the structure of each noun phrase in S is {k, v} which indicates the noun phrase and its total score, respectively.The scores as mentioned next are added to the v for each k for a given p.The factors are assigned as follows.A 'definite noun phrase', replaced with 'demonstrative' or 'possessive' pronoun or by 'definite noun phrase', is assigned the score 0. The candidates, representing 'given information' or 'theme', are considered as antecedents, and a score of 1 is assigned.A score of 1 is assigned to the candidate if any specific list of verbs is satisfied.The noun phrases with reiterations are a good choice for antecedents, the score is assigned depending on the number of repetitions within the same text window.A score of 2 is assigned if the occurrence repeats minimum twice within the same window, if repeats once assign 1 else 0. Phrases that are preceded and succeeded by a prepositional phrase are penalized as -1, non-prepositional phrases are preferred, and secure 1.If the candidate carries alike collocation pattern with pronoun are preferred and secures 2, others score 0. If the noun phrases (N.P.), pronoun phrase (PNP) and verb (V1, V2, and V3) follow the structure of sentence "... (PNP) V1 N.P... conj (PNP) V2 it (conj (PNP) V3 it)", where, 'and', 'or', 'before', and 'after' are some examples of possible conjunctions.The N.P. succeeding V1 is quite a similar candidate for the antecedent of PNP 'it' of preceding V2, and therefore selected as the preferred term and awarded 2, else 0 is assigned.A referential distance is estimated for the candidate in the following order; assign a score of 2 if it belongs to the same sentence/line of the pronoun if it belongs to the preceded line then a score of 1, and a score of 0 if it resides in the pre-preceded line.N.P.s constitute terms that are better candidates as an antecedent, and thereby receives an additional score of 1.
A possible example of the algorithms presented here is as follows.Let us assume t as "a month ago, you and I co-led a letter with over 100 colleagues of mine and sent this" to another 45.", posted by the user 'ct_turnip' in reply to a tweet posted by 'honor_man' that states "Together we will fight #DefendDACA".Here, the list of pronouns to be resolved is {'you', 'I', 'mine', 'this'}.The first pronoun in the list, 'you', and can be resolved to the parent tweet's author, 'honor_man', U is empty.The pronouns 'I' and 'mine' are both first-person singular pronouns, therefore, must be resolved with the author's name 'ct_turnip'.To resolve pronoun 'this', a possible list of nouns is prepared as {'honor_man', 'ct_turnip', 'DefendDACA', 'month', 'letter', 'colleague'}.The initial score is set to 0 for all the candidates.The method, maxSimilarity(p, S), the assigns scores to each of the noun phrases as discussed previously, and they are {'hono_man': 0, 'ct_turnip: 2, 'DefendDACA': 3, 'month': 2, 'letter': 3, 'colleague':1}.We can observe that both 'DefendDACA' and 'letter' same score.Then, as antecedent the immediate reference candidate is considerable.Therefore, the pronoun 'this', is replaced by 'letter', and the final resolved tweet is "a month ago, honor_man and ct_turnip co-led a letter with over 100 colleagues of ct_turnip and sent letter to another 45".It is to be noted that we used a certain pronoun resolution method in our proposed algorithm.There exist several similar methods in the literature that may improve the performance of the desired outcome.However, these algorithms are not explored here and are considered as the future scope of our work.

RESULTS AND DISCUSSION
In this section, we present our experimental results, analyses, and discussions.However, before that, we present the existing OED algorithms that are used here to perform our experiments, information on the dataset used, and the experimental setup of the evaluation systems.

Online event detection algorithm
For the evaluation purpose, we consider two major document pivot OED approaches in this area, proposed by Becker et al. [4] and Hasan et al. [5], as the document pivot techniques are focused on the clustering of documents by satisfying certain similarities between the documents.According to Becker et al. [4], an event, denoted by e, is a real-world occurrence and is associated with a period (te), and a time-ordered streamed Twitter message set (Me) where all the messages in this set are posted in the period te.To fulfill the goal of event clustering through tweet streams in real-time, the approach uses an incremental online clustering algorithm.During this phase, a threshold value for similarity score is set empirically, where the The incoming tweet will be inserted to an existing cluster that has the highest similarity score if the similarity value is within the previously set threshold, otherwise, a new cluster is formed.Each of these clusters is a potential candidate portraying possible new events.The authors, in this article, further proposed an SVMbased classification technique to classify these events into real-world events and non-events, which is out of scope for our article.
Hasan et al. c proposed a new approach TwitterNews+ [5].It is an incremental clustering algorithm that has comparatively less computational complexity as it throws away the old clusters after satisfying a threshold of the period to occupy the new clusters.TwitterNews+ consists of two main components: 'search module' and the 'event cluster module'.From the set of latest tweets maintained by this system, the Search Module provides fast retrieval of similar tweets and a binary decision on the uniqueness of an input tweet.A tweet from the search module, tagged as 'not unique', is handed over to the event cluster module that searches for a candidate cluster where tweets can be assigned.This module has a de-fragmentation submodule to merge small fragmented clusters.To get newsworthy events from the candidate event clusters, this system uses a set of different filters and word-level longest common subsequence.A notable pre-processing task is performed by TwitterNews+ that involves the removal of spam phrase tweets (such as 'free access', 'click here').

Experimental set-up
For the experiments, an open-source dataset from the George Washington University website [37] is used hereafter hydration.The dataset consists of tweets over the period of August 2013 to January 2019.As we consider English tweets, all the non-English tweets are filtered out before use.However, due to incomplete information in the old tweets, we remove all the documents before 2017 to maintain consistency among the data points considered.For evaluation of our work, manually annotated ground truth for clusters is prepared.Given that the proposed work targets streaming social media, the proposed experimental set-up imitates this property by treating the database as time-series data through the timestamp associated with each hydrated tweet, and by not considering any document having a higher timestamp value than the document where the resolution is being performed.
The most commonly used evaluation metrics, namely, precision, recall, and F-Measure, are used here to evaluate the efficiency of implemented work.The evaluation is done by considering the common cluster membership of object pairs in clustering.This common cluster membership is used to calculate recall and precision.To run the experiments, we run both the existing procedures as explained in section 4.1 separately.For each of them, we first run the procedure with the previously mentioned dataset, then the same algorithm is run again with the dataset while resolving the pronouns for each document first through the algorithms as explained in the previous section.For evaluation for these outcomes, each of them is compared with the previously formed ground truth.Finally, the proposed metrics are measured based on the truepositives (tp), false-positives (fp), true-negatives (tn), and false-negatives (fn).Apart from these three parameters, as we intend to demonstrate that the number of documents properly clustered should increase, we propose a new comparison metric, the average number of true positives for event clusters, which will be able to demonstrate the average increase or decrease in document density of the event clusters.

Evaluation and result analysis
We begin the evaluation by examining the procedure as proposed by [4].on the given dataset along with (named base data), and with our proposed pronoun resolution (named P.R. data) algorithm.Similarly, we also examine the TwitterNews+ algorithm with these two corpora and present the results through the parameters discussed above.The results for the precision, recall, and F-measure parameters are shown in the Table 1 and Table 2 respectively for the mentioned algorithms.
From Table 1 and Table 2 we can see that the approaches proposed in both the articles achieve better precision values with the pronoun resolved corpus.Where we observed a 2.79% increase of precision for the approach by [4], a 3.52% hike of precision can be seen for TwitterNews+ as proposed by [5].The primary reason behind this increase in these values is related to their choice of weight metric selection.Both the papers, though they have different clustering techniques, have used a term weighing metric, tf-idf, where the weights are assigned based on the common terms in two documents.Considering the fact that many of the tweets in a corpus are replies to earlier tweets, they at many times are not clustered together for the lack of many common terms.This happens as the writer of the reply tweet may choose to replace words with pronouns that are already mentioned in the previous tweet.Due to the uses of pronouns in place of the nouns, the similarity of these two documents may reduce greatly, thereby affecting the clustering process.By resolving the pronouns and replacing them in place through the proposed method, the performance of the existing methods is improved as the weighing metrics perform better with the pronoun resolved data.

93
For the same reasons as discussed above, a similar trend can also be seen in the Recall and Fmeasure values for both the algorithms.One important point to be noted here is that, while preparing the ground truth of this corpus, if a tweet is marked as a part of a certain event, all its reply tweets are also categorized as part of the same event.However, it has been observed that most of the reply tweets are not necessarily contributing to the discussion and do not contain either the important words or pronouns pointing to those words.This leads to a huge decrease in the recall value of the clustering methods as the term weighing metrics is solely dependent on the re-usability of common words.The results shown in Figure 1 focuses on the primary aim of our work, which is to increase the information density in the clusters.We can see from the Figure 1, for both the protocols, the number of true positives has increased manyfold on average for the identified event clusters.This is due to the same reason we have iterated before, that is, as many of the pronouns are resolved to their suitable noun phrases, the term weighing metrics can cluster the reply tweets more efficiently by assigning a higher score to them.In the results shown above, we considered a fixed value for Wd for simulation.In future, we will consider the effects of different values for Wd, and explore possibilities of developing adaptive Wd window size for this proposed work.

CONCLUSIONS AND FUTURE WORKS
In this paper, we presented a pronoun resolution algorithm for streaming data from social media platforms that can be used before applying various OED algorithms.Our results show that, by applying our algorithm before OED procedures, these procedures incur improved efficiency in clustering the relevant documents.Moreover, to resolve the pronouns, our method needs to fetch only a fixed number of previous documents in the reply chain of posts, rather than fetching all the previous documents.Though we have presented our results using Twitter data, it can be used in data from any social media platforms that have the reply feature.The selection of the size of the document window remains a future work for this work.Another area to focus on in the future is to explore other pronoun resolution algorithms that can further enhance the performance of this work.

Figure 1 .
Figure 1.Increase in the average number of T. P. for event clusters