Abstract
In recent years, the microblogging services like Twitter has become a major tool for sharing events,
expressing opinions and communicating with friends. Several thousand of microblogs are posted each
second describing the ongoing events around the world. Because of this recent rise in the popularity
and size of social media, there is a growing need for system that can extract useful information from
social media. The use of limited words (140 characters), special vocabulary with incomplete sentences and emoticons, heavy use of abbreviations and misspellings make traditional text analysis methods less effective on Twitter. Entity Linking is the task of extracting semantics from text by linking the entities present within the text to the knowledge base. For example, given a text “India is my country”, we need to extract important mentions like ‘India’ and link them to their corresponding Wikipedia pages. Entity Linking is an important task as it not only provides important meta-information about the linked entities but also helps us in extracting the relations among various entities present within the text. Meta-information and relations are really helpful in extracting other important information like documents’ topics etc. Since this also removes the problem of synonymy and polysemy, therefore it is more easy for computers to analyse the text. So, entity linking is really helpful in making computers understand the text usually meant for human consumption.
In this thesis, we device a method for efficient entity linking. We have devised various features that
helps in detecting and linking entities present within the tweet. Firstly, the mentions present within the
tweets are identified. We experimented various approaches like heuristics based approach, probabilistic approach and sequence labelling based approach for detecting mentions in the tweet. Finally, we proposed a classification approach for mention detection. Along with identification, the detected mentions are also ranked based on their predicted informativeness content. The user can select the mentions by setting the threshold on the ranks based on application. All the approaches are evaluated using various datasets like NEEL Challenge Dataset, User Dataset, TagMe Dataset and IITB dataset. Classification based methods performs best on all the datasets with performance of 0.59 F-measure on NEEL Dataset. The detected mentions are then disambiguated on the basis of context. Disambiguation means selecting the best entity from the set of candidate entities for the given mention. Due to limited characters, context is usually not sufficient to disambiguate all the entities present within the tweet. So, we have also explored users’ interests and recent tweet information to improve the disambiguation on the short and noisy tweets. The information provided by users’ previous tweets is modeled to build users’ topic of interests. These interests along with the context is then used to disambiguate the entities in the users’ new tweet. We also explored trending topics information (at a particular time on Twitter) to improve entity disambiguation in the tweets. The user based approach is tested using user dataset collected from
around 200 tweets of 20 users with two context modeling systems. Recent tweet information based
approach is tested using the standard NEEL Dataset and tweets from recent events collected in June
2014. Experiments show that extra information based on users’ previous history and recent tweets helps
in improving the performance of entity disambiguation systems. Hashtags are semantico-syntactic constructs used across various social networking and microblogging platforms to enable users to start a topic specific discussion or classify a post into a desired category. Segmenting and linking the entities present within the hashtags could therefore help in better understanding and extraction of information shared across the social media. However, due to lack in space delimiter, extracting semantics from hashtags is a non-trivial task. Most of the current state-ofthe-
art social media analytics systems like Sentiment Analysis and Entity Linking tend to either ignore
hashtags, or treat them as a single word. We present a context aware approach to segment and link
entities in the hashtags to a knowledge base (KB) entry, based on the context within the tweet. Our approach segments and links the entities in hashtags, such that the coherence between hashtag semantics
and the tweet is maximized. As the segmentation and linking is jointly modeled, the system chooses
the segment that maximizes the relatedness between tweet and the hashtag. So, the best segmentation is
picked based on the context, if multiple segments are possible. The approach is tested on two different
datasets, one is synthetically created using NEEL Dataset and the other is manually created. We have
achieved the P@1 score of 0.914 on NEEL Dataset and