Abstract
Parsing morphologically rich, free word order languages like Hindi, Czech, Turkish, etc. is a challenging task. Unlike English, most of the parsers for such languages have adopted the dependency
grammatical framework, which is better suited for the task.(Shieber, 1985; Meluk, 1988; Bharati et. al,
1995). Due to the availability of manually annotated corpora in recent years, data-driven dependency
parsing has achieved considerable success. In this work, we explore the performance of three data-driven dependency parsers, Malt, MST and Partial parser, for Hindi dependency parsing. Our system uses a hindi corpus that has manually annotated dependency relations and automatically extracted features like POS, chunk, morphological information, etc. and does complete word level parsing. We achieved an accuracy of 86.5% Unlabeled Attachment Score(UAS) and 77.9% Labeled Attachment Score(LAS), which is the state-of-the-art for automatic hindi dependency parsing.
We tried to improve the parsing accuracy by using two bootstrapping techniques, self-training and
co-training. Self-training is a simple process of incorporating unlabeled data into the training data of
the parser. It uses the parser’s own output on the unlabeled data to improve its performance. Co-training
is a technique that uses the output of two or more parsers of comparable accuracy on unlabeled data
to provide additional training data for each other. Considering the automatic hindi dependency parser
as our baseline, we applied self-training by using a large raw corpus. We also performed co-training
between the state-of-the-art parser(Malt), another parser with comparable accuracy(MST) and a parser
which can learn from partial structures(Partial parser). We experimented with various parameters to
select top sentences for bootstrapping like the average classifier score of sentence, product of classifier score, etc. We also extracted highly confident sub-parses using different techniques which were used to improve the performance of the Partial parser.
Both self-training and co-training were performed using two different types of raw corpora, one
from the same domain as the training and test data and another from a different domain. We present
a comparative analysis of the effect of domain of raw corpora on bootstrapping techniques. We also
compare the effect of using automatically parsed data via bootstrapping and additional gold data on
parser accuracy. By bootstrapping, we achieved the best accuracy of 87.1% Unlabeled Attachment
Score(UAS) and 78.8% Labeled Attachment Score(LAS), which is a significant improvement over the state-of-the-art accuracy for automatic parsing. Using highly confident sub-parses, we were able to improve the performance of the Partial parser from 79.85% UAS to 84.5% UAS, which is comparable to
the state-of-the-art accuracy for automatic parsing.