Abstract
Microblogging services like Twitter enable communication at a massive scale. It has been recently reported1 by Twitter that every month, 248 million users access the microblogging platform, and create around 500 million posts everyday in more than 35 languages. This tremendous growth gives an opportunity to mine useful information that is being shared across social media. However, there exists a 140 character limitation on the posts (also known as “tweets”) that users can create on Twitter. This results in heavy use of emoticons, abbreviations, misspellings and has lead to various linguistic “innovations” that render the traditional text analysis techniques less effective.
Another interesting aspect of such tweets is the usage of semantico-syntactical constructs called
“hashtags”. Hashtags are “#”-prefixed keywords used by people in order to organise the meaning of
their tweets. Also, hashtags enable classification of tweets, since the posts using same or similar hashtags
are expected to be semantically related to each other. The challenge posed by hashtags is the fact
that most hashtags are not simply “#”-prefixed keywords, but “#” symbol prefixed with concatenation of various words or phrases which are not space delimited. For example, consider the hashtag - “#NSAvsSnowden”. We observe that this hashtag is essentially “NSA vs Snowden”, which is not a single keyword, but a concatenation of various words. In this thesis, we discuss and compare various approaches in order to “segment” the hashtag into meaningful words. Also, our task extends beyond just the segmentation of hashtag - we present a unified framework to also perform “entity-linking” on various constituent entities in a hashtag.
Entity Linking is an established IR task, where the goal is to extract latent semantics from plain text
by linking the text to a knowledge base (KB) such as Wikipedia. Consider, for example, the following
text - “Snowden reveals classified information from NSA”, we first need to identify various entities in
this piece of text, followed by “disambiguating” them, and establishing a link between those entities
and some knowledge base (KB) so that we have additional contextual information available about the
concerned entity. This approach has been found to be instrumental in order to teach the meaning of text
to machines, which is otherwise meant for human consumption.
Hence, after performing entity linking on the segmented hashtag - “Snowden vs NSA”, we would
have enriched the text with additional semantic information by establishing links between “Snowden”
and corresponding Wikipedia page - “Edward Snowden”, between “NSA” and “National Security
Agency”. “NSA”, in principle, could also refer to “National Sports Academy” or “National Security
Act”, and this is exactly where “disambiguation” becomes important for mention-resolving, which
employs contextual information to perform this task.
Since hashtags are human curated labels associated to tweets, our premise is that segmenting and
linking the entities present within the hashtags could therefore help in better understanding and extraction
of information shared across the social media. Traditionally, most of the IR tasks have treated
hashtags as either a single word, or have ignored them for all practical purposes. We demonstrate how
extraction of semantics from tweets improved, when additional semantic information was made available
by our system by segmenting and entity-linking hashtags. We demonstrate this by performing
various experiments on NEEL Challenge Dataset, and a human annotated subset of Stanford Sentiment
Analysis Dataset, which has also been made public to ease future research in this area. We have
achieved the P@1 score of 0.914 on NEEL Dataset and 0.873 on the manually annotated Stanford Sentiment
Analysis Dataset for hashtag segmentation and linking.
We also showcase how our approach leads to improvements in the task of “Semantic Microblog
retrieval” and “Semantic Hashtag retrieval”. Microblog retrieval refers to retrieval of a ranked list
of microposts given a query Q. Hashtag retrieval, on the other hand is a relatively newer IR task. It
basically refers to retrieving a ranked list of the top-k hashtags relevant to a user’s query Q. To retrieve
information related to a user’s interest, for instance, “Rock concerts”, it’d be very helpful to the user if
they can be suggested a list of hashtags which are commonly used in relation to “Rock concerts”. By
tracking these hashtags, a user can gain information about rock concerts via the posted tweets. However,
it’s not possible for the user to manually figure out all the hashtags that are used across Twitter, relevant
to their interest. In this thesis, we also address this problem. In order to solve these two retrieval
problems, we propose and discuss a virtual docume