Abstract
In the last few years, web technologies and thus social media have shown an immense growth. Social
media allow users to share their experiences with others and influence connected people. Users can post their views and opinions in the form of blogs, reviews, status updates, tweets, etc. This large volume of data calls for quick and powerful automatic analytics system in order to gain useful insights about user’s tastes.
Sentiment Analysis is a study of human behavior in which we extract user opinions and emotions
from plain text. Sentiment Analysis is a combination of natural language processing, computational
linguistics and information extraction and is applied to extract subjective information from text.
Every textual information can be categorized into two main classes: Opinions and Facts. Facts are
objective sentences describing some events or people or organizations which are not deniable. A good example of factual data are news articles. News articles contains the exact and correct information
about the event. Opinions on the other hand are usually subjective expressions which describe people’s
sentiments, feelings or emotions.
The key task in sentiment analysis is to extract the attitude of the speaker or writer towards a subject.
Sentiment classification can be considered as a two class (subjective vs objective) classification at top
level which can be further segmented to two more classes (Positive and Negative). In general literature,
sentiment analysis is primarily done as positive vs negative classification task with little (almost zero)
focus on subjective vs objective classification.
Web has proved to be a good source of subjective information (opinionated text) in form of reviews,
blogs, forum posts, etc. but there are some cases where we face the problem of information overload.
In this research, we worked on Product and Movie Reviews, Blogs and Tweets. Each of these genres of
user-generated content exhibit some unique qualities which differentiate them from each other. Reviews
range from few words to few paragraphs and generally talk about the specific object. Tweets on the other hand are just 140 characters and due to this character limit they are noisy, unstructured but object specific. Blogs are document like big and generally talk about one or more objects simultaneously. We can find hundreds of websites providing opinionated user-generated content, it is difficult to go through all and then make some decision. Thus, automated sentiment detection and summarization is an important task. From product and movie reviews, we extracted N-Gram based features with and without part-of-speech information and devised a scoring function to find the subjectivity expressed in the text. We also tested the same set of features on various supervised machine learning algorithms like SVM, NB, MLP, etc and proved the close correctness of our proposed scoring function. Sentiment lexicons also prove to be a useful resource in identifying the polarity of opinion expressed in text.
We use SentiWordNet as our base lexicon to identify the opinion expressed in reviews. Our methods
outperform the previous benchmark accuracies by 3-5%.
Tweets are 140 character status updates and suffer largely in their structure and other grammatical
errors. Its hard to apply NLP tools or basic rules to extract sentiment information from tweets because of
its anomalies. We present a basic method to pre-process, restructure tweets and mine opinions expressed in them. We use two different datasets and perform 5 fold cross validation of our approach. Also we tested our approach on cross dataset training and testing. Proposed methods in this research show an improvement of 2-4% in classification accuracy.
For English blog analysis, we propose a novel approach to perform sentiment mining and summarization based on objects (entities) in the blog. It is difficult to perform sentence level or document level sentiment classification in blogs because of their long size (in general) and contextual flow. Using basic
NLP tools available, we first identify potential objects and then associated opinion towards each of these
objects.
Major contribution made in thesis are
• Scoring function for review classification: This function can be used as a replacement to heavy
duty machine learning algorithms.
• Feature segregation as NLP-based and Twitter-specific features, correct pre-processing techniques
for tweets.
• A novel approach towards entity centric opinion mining from blogs along with entity specific
opinionated dataset for English blogs.
Keywords: Sentiment Analysis, Opinion Mining, User-generated Content, Blogs, Twitter, Movie
Reviews, Product Reviews