Abstract
In Natural Language, two senses of reference are generally described. First is the symbolic
relationship that expressions (words or phrases) have with the concrete or abstract real world
objects and second sense is the relationship between two textual expressions in the text, in
which one expression provides the necessary information to interpret or relate the other. These
expressions are distributed across the text and their connections make co-referential chains (or
links). For a text, coreference resolution is a task to identify and link expressions that refer to
same object. These coreference expressions can be anaphors, nominal, verbal or verb-nominal
expressions. Coreference links (chains) are useful in various NLP applications such as question
answering, text summarization, Machine translation, etc.
For English and several other European languages, the task of resolving coreference has been
studied to a sufficiently great extent and various techniques have been proposed for those languages,
but for Indian Languages, work has been very limited. In this thesis, we describe our
work on second sense of reference by proposing co-reference representation scheme and different
co-reference relation types between continuous mentions of the same coreference chain such, as
identity, near-identity and weak identity relations and their sub-types. Then, we also propose
and describe methods to resolve coreference for Hindi news and dialogue text. The six major
points (contributions) are covered in this thesis on the topic of coreference.
First, we identified Indian language specific conceptual, structural and representational issues
in the existing coreference annotation schemes and tried to resolve them by proposing
a unified coreference annotation framework and procedure. This framework includes various
aspects of coreference like expression span, coreference chain, relation between contiguous expressions
of same coreference chain, etc.
Second, Based on the proposed annotation scheme, we developed a semi-automatic annotation
tool (CAT - Coreference Annotation Tool) to ease the annotation process.
Third, we propose a method for calculating inter annotator agreement on various aspect/level
of coreference annotation.
Fourth, Using proposed coreference annotation scheme and CAT, we annotated coreference
information on some part of Hindi and Urdu Dependency Treebanks.
Fifth, We also present a hybrid multi-classifier based approach to identify reference type for
an anaphor (pronoun). We describe a hybrid (learning and rule) approach to resolve entity and
event referring pronouns for Hindi dialogue and news text. In these approaches, we explore
the use of dependency structures as a source of syntactic information and for entity anaphora
resolution. We compare use of dependency structure based rules (features) over syntax based
rules (features) for event anaphora resolution. Other than dependency based features we also
explore the use of other linguistic information such as sub-topic boundary, animacy and Named
Entity categories for dialogue anaphora resolution.
Sixth, we present a sieve based approach for Hindi nominal reference resolution and relation
type identification between continuous expressions of same coreference chains. In the this approach,
we explore the use of Paninian dependency grammar, various linguistic rules based on
gender, number, person, animacy, dbpedia based dictionaries, word-embeddings from word2vec
- Glove as features and rules to resolve nominal co-reference. Hybrid system on these rules and
features with various sieves (on predefined preference) gives considerable amount of accuracy
for nominal reference resolution and relation type identification.
At conclusion, we combine all the above mentioned modules (entity anaphora resolution,
event anaphora resolution and nominal reference resolution) into one software-kit so the NLP
community can use it and provide us feedback on the presented approaches.