Abstract
The World Wide Web (WWW) is a huge, widely distributed global source of information to web users. Web documents are broadly classified into: unstructured and structured documents. Users prefer structured documents when looking for a piece of information. Hence, in the past decade research community focused on mining structured information from unstructured documents and attempted to preserve them in the form of attribute-value pairs, tables, flow charts etc. But, the focus has been only on extracting information at document level or on particular domains like disaster, finance, medicine etc. The techniques never attempted to integrate the
extracted information to common knowledge repositories like Wikipedia, DBPedia etc.
Structured databases like Wikipedia, DBPedia etc are created through collaborative contributions from volunteers and organizations. Since they rely heavily on manual effort, the process of updating these databases is not only tedious and time consuming but is also fraught with many drawbacks. Hence, automatic updation of structured databases has become one of the hot topics of research in the past few years. Automatic updation of structured databases can be broken down into two sub problems: Entity Linking and Slot Filling. In this thesis, we address Entity Linking. Entity Linking is the task of linking named entities occurring in a document
to entries in a Knowledge Base. This is a challenging task because entities can not only occur in various forms, viz: acronyms, nick names, spelling variations etc but can also occur in various contexts.
Once named entities from documents are linked to entries in a knowledge base, information can be integrated across them. Current IE techniques can be used to extract information from documents. Person named disambiguation and Co-reference Resolution are two tasks that share a lot of similarities with Entity Linking. These tasks have attempted to link entities across documents but never attempted to integrate them into a common Knowledge Base.
Our approach to Entity Linking begins with building of an Entity Repository (ER). ER contains information about different forms of named entities and is built
using Wikipedia structural information like redirect pages, disambiguation pages and bold text from first paragraph. Our core algorithm for Entity Linking can be broken down into two steps : Candidate List Generation (CLG) and Ranking. In the CLG phase, we use the ER, Web search results and a named entity recognizer to identify all possible variations of a given named entity. Using these variations we obtain an unordered list of candidate nodes from the KB which can be linked to the given named entity in a document. In the ranking phase, we rank the unordered list of candidate nodes using various similarity techniques. We calculate the similarity between the text of the candidate nodes and the document in which the named entity occurrs. We experiment ranking using various similarity
functions like cosine similarity, Na¨ıve Bayes, maximum entropy, Tf-idf ranking and re-ranking using pseudo relevance feedback. Our experiments show that cosine similarity and Na¨ıve Bayes perform close to state of the art and the Tf-idf ranking
function performs better in some cases.
Our approach was tested on a standard Entity Linking dataset provided as part of Text Analysis Conference (TAC) for Knowledge Base Population (KBP) shared
task. We evaluated our approach using Micro-Average Score which is the standard evaluation metrics. We achieved very impressive MAS of 83% and 85% on TACKBP, Entity Linking 2009 and 2010 data sets, which secured top spot in these shared
tasks respectively.