Abstract
With the ever-increasing influence of social media in our lives, it is important to automatically make
sense out of the enormous amounts of social media data generated each day and leverage it for different
applications. A politician trying to gauge her chances of winning the election, the government reaching
out to understand the grievances of the citizens regarding various administrative issues, a company
attempting to promote its products and perceive its market reach - all of them have one thing in common,
their destination, i.e. the social media. Even a few years ago, a restaurant interested in customer
feedback would have to run manual surveys, requesting customers to fill out lengthy feedback forms.
Not only did this approach have limited reach, it also had the additional problem of customers not being
responsive. Now, with the advent of social media, this process had become much less cumbersome,
with people reporting their experiences openly and unguardedly, thereby minimizing the cost incurred
to the companies many folds. However it is challenging to mine information from the huge amount
of user generated unstructured social media data. Understanding this data entails primarily identifying,
understanding and analysing the topics discussed in social media along with the opinions expressed towards
those topics. In this thesis, we focus on this problem, by first treating the two tasks of topic and
sentiment identification independently, before eventually presenting a joint model.
Topic identification in social media posts have been widely studied in the recent past. In contrast,
we direct our attention towards capturing the topics in the conversations that arise out of interactions
between users on social media. Specifically, given a conversation on Twitter, we aim to automatically
recommend a relevant tweet, treating the conversation as a tree. Existing solutions to this problem exploit similarity of a candidate tweet to a single tweet, or to past tweets in a user-pair conversation. In this work, we generalize the problem setting to recommend a tweet considering the context from an entire conversation tree which often includes tweets from multiple users. While this setting is more natural, it brings in additional challenges: (1) how to choose an anchor tweet node from the conversation tree for which a new tweet can be recommended as a reply? (2) how to choose the tweet to recommend?
We learn regression models with novel features to address both the challenges, and use them to perform
extractive response recommendation. The first regression model predicts the time required by a tweet
node to get its first child node, while the second predicts the number of retweets received by a tweet, as a measure of its popularity and acceptability, and hence quality. Experiments with millions of tweets show that the proposed recommendation method is more accurate compared to the state-of-the-art approaches,with respect to ground-truth labels. Due to lack of manually annotated data, we have used proxy signals
to infer labels and used them as ground truth.
Sentiment Analysis is another important task that complements topic identification in order to capture
actionable information from social media. While sentiment analysis has been studied widely, it is not
a completely solved problem when it comes to the noisy nature of the social media data. In this work,
we introduce some novel features and learn supervised models to improve performance in the area of
sentiment analysis in Twitter. With careful feature ablation experiments, we show which of the novel
features contribute to the performance improvement of our system and to what extent. It is a simple
model that performs competitively with respect to state-of-the-art systems and can be quickly and easily
prototyped as an end-to-end system from scratch. But, one drawback of this work is that we have not
taken the topics into consideration. Given a piece of text, we assign a sentiment label to it, irrespective
of the topics being discussed in that text. This is especially problematic, when multiple topics are at
play and the sentiment expressed throughout the text is not uniform.
In order to alleviate this shortcoming, we further aim to learn topics and sentiment together. However,
it is still a pipeline model instead of a joint one. In this model, we want to overcome the drawbacks of
topic-agnostic sentiment analysis and tackle the problem which is formally known as Aspect based
Sentiment Analysis, in which the goal is to identify the sentiment associated with each of the aspects
being discussed in a post. First the aspects or topics being discussed in the text are identified and we treat
this sub-task of aspect category detection as a multi-class multi-label classification problem. Instead of
assigning a single generic sentiment label to a piece of text, our goal here is to associate a sentiment
label to each of the aspects or