Abstract
Abstract
Morphological analysis is a fundamental task in most Natural Language Processing (NLP) applications,
especially for morphologically rich languages such as Hindi. Morphological analysis for Hindi
involves predicting lemma, POS, gender, number, person, case-marker, TAM and vibhakti. In general,
morphological analyzers predicts all possible analyses for a given word. For most of the NLP tasks, instead
of having multiple multiple analyses for a word, we need to disambiguate those multiple analyses
and arrive at one analysis which best fits in the given sentential context.
The prime motivation for carrying out the research in this thesis comes from the initial set of Hindi
parsing experiments carried out by us. The existing morphological analyzers for Hindi do not predict
context based morphological analysis. Because of lack of automatic context based morphological information,
parsing accuracy could not be improved. In this thesis, we try to predict a single analysis for
a word in a given context. This thesis deals with predicting context based morph information for the
attributes viz. lemma, gender, number, person, case-marker, TAM and vibhakti. The existing analyzers
also perform poorly for Out-Of-Vocabulary (OOV) words. We also aim to address this issue and make
an exhaustive evaluation of our predictions for OOV words.
For lemma prediction, we adopt a machine translation approach. For gender, number, person and
case-marker prediction we perceive it as a classification problem. TAM and vibhakti are better predicted
by rule based approach. For lemma, gender, number, person and case prediction, we achieved an overall
accuracy and OOV accuracy of 84.25% and 63.06% respectively.
To present our case that the predicted morphological information helps in NLP applications, we
carry out parsing experiments without and with the predicted morph information and report Labeled
Attachment Score of 87.75% and 89.41% for both the experiments respectively.
Building machine translation models are time complex. Hence, in the later part of the thesis we
conceive lemma prediction also as a classification problem. We also experiment with other features
for gender, number, person and case prediction. We report overall and OOV accuracies of 85.87% and
65.96% respectively. We have seen an 1-2% improvement from our earlier set of experiment results.
We extend our approach to other Indian languages viz. Urdu and Telugu for predicting context based
morphological information.