Abstract
Natural Language Processing (NLP) is a challenging field in the area of artificial intelligence and
computational linguistics. In simplistic terms, it can be defined as processing a natural language in any
form either speech or written text. Though extensive research has been done for many of the languages
in the world, Indian languages are still lagging behind in the race.
Processing of any natural language requires analysis to be done at multiple levels like word-level,
phrase-level, sentence-level, semantic-level and higher levels of pragmatic and discourse. In this work
we are presenting our efforts of making new advancements at the sentence level, which in linguistics
terms, is regarded as Syntactic Parsing. Syntactic parsing involves establishing relations between different
words of a sentence to convey the possible meaning.
Indian languages are morphologically rich and exhibit free-word order (MoR-FWO). Dependency
Parsing, a type of syntactic parsing is better suited to such languages. Our efforts start from delivering
a state-of-the-art Hindi Dependency Parsing system through the platform provided in the form of a
shared task (Sharma et al., 2012). We employed a data-driven transition-based statistical system (Malt
Parser), trained on Hindi Dependency Treebank (Bhatt et al., 2009; Palmer et al., 2009). Error-analysis
performed in the task, helped us to target the problems in a more specific manner.
In the next phase, to target some of the problems like case ambiguity, data sparsity, lack of case
marker, etc., we aided the process of dependency parsing by enriching the training model with semantic
information. The information is extracted automatically from a rich lexical resource, Hindi Word-
Net (Narayan et al., 2002). Learning from the insights obtained in this advancement process, we moved
to another well-established approach, Ensembling. Ensembling works on the principle of exploiting diversity
of multiple parsing systems and combining their strengths to improve the parsing performance.
We explored two ensembling approaches namely, re-parsing algorithms and word-by-word voting, using
six different weighting strategies to combine six algorithmic variants of Malt parser. Improvements
had been observed in the second approach.
After establishing a systematic comparison between both the techniques of ensembling, we obtained
the lead to search for better weighting strategy to improve ensembling. The search ended with Parse
Quality Estimation (PQE) score. Adapting the work done in the past, we extended this functionality
for our purpose of performing ensembling. The approach, which has failed earlier, has now shown
improvements. Further, we also expanded the scope of PQE score for dependency arcs (attachment) to
capture confusion made by the oracle in the parsing system. The functionality had also been extended
for Joint prediction of both arcs and labels simultaneously. To prove the efficacy of the approach, we implemented several real-world applications of PQE score. Finally, we proposed a robust evaluation
framework in terms of Domain Adaptability (DA) and Inter-Language Portability (ILP), to better judge
the effectiveness of Hindi Dependency Parser. During this evaluation process, using the property of
portability, we also built dependency parsers for two of the Dravidian languages: Tamil and Telugu,
which can be integrated easily in real world NLP systems like Machine Translation.