Abstract
Indian languages are MoR-FWO and hence differ from English in structure and morphology.
There are many distinguished characteristics possessed by Indian languages. While working
with these languages we have to keep in mind, these characteristics and plan strategies accord-
ingly. We worked on improving Dependency Parsing for Indian Languages, more specifically
for Hindi, an Indo-Aryan Language.
In the conventional Dependency Parsing methods, the focus has been on developing robust
data driven dependency parsing techniques. This initiated efforts in creating hand annotated
large treebanks, consisting of hand annotated features. These treebanks serve as input for the
training of data-driven parsers. The annotations in Indian Languages’ treebanks are generally
multi-layered and furnish information on part of speech category of word forms, their morpho-
logical features, related word groups and the syntactic relations. For improvements, richer and
richer features are being added. This process of manual annotation is expensive, as it requires
a lot of human efforts. It is a tedious task to create treebanks for all the languages. Even if
we make the treebanks available, in the real time scenario we require many tools to extract
features automatically. Building such tools is also a complex task. We are in an era with
almost unlimited access to raw data. Nevertheless, we often struggle to make sense of most of
it. Much of this data is unlabeled and thus useless in many of the traditional supervised ma-
chine learning scenarios, that require explicit labeled/hand-annotated examples. In this work,
we present our efforts towards exploring cost effective approaches for building and improving
parsers for resource-poor languages. For this purpose we try to use unsupervised techniques to
extract features from the largely available mono-lingual raw corpus. Using cross-lingual treebank transfer, we exploit the available treebanks for other languages
and using some techniques like MT and try to generate a treebank for the target language. We can use this treebank for training of parser. We first try this approach for Hindi. An important constraint for using this approach is that the annotation of treebank needs to be similar cross linguistically. For this, we use UD framework. Universal Dependencies is an initiative to create cross-linguistically consistent treebank annotation for many languages, with the goal of vii
facilitating multilingual parser development, cross-lingual learning, and parsing research from a
language typology perspective. In the previous studies, we have seen that CPG framework is
better suited for Indian Languages. So we try to compare other techniques with the cross-lingual
parsing in UD framework. In the concluding work, we try to make use of Vector Space Modeling on a large monolingual raw data, a recent technique being used widely across different tasks. Use of large monolingual
corpus helps reducing the problem of data sparsity. We try to explore this technique to achieve
three goals. Using word-embeddings extracted from vector space modeling as features, we
first try to improve the state-of-the art accuracy for Hindi. Here we use word-embeddings as
additional features other than the conventional features. The second goal is to help building
parsers for less resourced languages. This is done by replacing the costly linguistic features
with word-embeddings. This requires minimal human annotation. The third goal we achieve,
is improving parser’s performance in general domain data. We show results where parser
is trained on News domain and input sentences are from four different domains Box-Office,
Cricket, Gadget and Recipe.