Abstract
The present work attempts to build an automatic translation system of nominal compound (NC) from
English to Hindi. A noun compound is a sequence of nouns acting as a single noun, e.g., colon cancer,
suppressor protein, colon cancer tumor suppressor protein. They comprise 3.9% and 2.6% of all tokens
in the Reuters corpus and the British National Corpus (BNC), respectively. As of today, no good system
exists for the translation of multi-word expressions from English to any Indian languages. We have
evaluated two state-of-the-art systems, Moses and Google Translation system, to check the Noun Compound translation accuracy from English to Hindi. Google translation system results in an accuracy of
57% while Moses, a statistical machine translation system, returns an accuracy of 48% on a test data of
300 Noun Compounds. The above figures indicate that automatic NC translation from English to Hindi
is an important subtask of machine translation system. We build a Noun Compound Translation system (NCT) which returns an accuracy of 64% on the same set of test data.
This thesis examines two approaches for translation of Noun Compounds from English to Hindi. We
have done a manual study on 50K parallel sentences from English to Hindi and have found out that Noun
Compounds in English are translated into Noun Compound in Hindi in over 40% of the cases. In other
cases they are translated into varied syntactic constructs. Among them the most frequent construction type is “Modifier + Post-Position + Head” which occurs in 35% of all the cases. Some examples are “cow milk” ! “gAya kA dUXa”, “wax work” ! “mOMa para ciwroM”. This observation motivates both the approaches for translation in the present thesis. The approaches are called in this work as: a) Translation of NC by paraphrasing on source side and mapping the paraphrase to target construct and b) Context based translation by searching and ranking translation candidate on target side.
In the first approach English nominal compounds are automatically paraphrased and the paraphrases are
translated into Hindi constructions. The paraphrasing is done with prepositions following [Lauer 1995]
approach of paraphrasing of nominal compound. For example, “cow milk” is paraphrased as ‘milk from
cow’, “blood sugar” is paraphrased as ‘sugar in blood’. Since English prepositions have one-to-one
mapping to post-position in Hindi, English paraphrases are easily translated into Hindi using the mapping
schema. Assuming that lexical substitution for component nouns of the compound is correct, this
method examines how paraphrasing of English nominal compound acts as an aid for translation.
In the second approach, we, at first, generate translation templates for the target language. These templates are all possible Hindi construction types that English nominal compounds can be translated into. Context based translation system take context into consideration while translating. We translate noun
compound by taking the sentence in which the compound occurs as the context. For example, the expression “finance minister” is the nominal compound to be translated in the sentence “The finance minister
declared the financial budget for this year”. Other content words in the sentence such as ‘declared’,
‘financial’, ‘budget’, ‘year’ form the context. We apply a Word-sense-disambiguation tool for selecting
the correct sense of the component nouns of NC in the given context. We use a bilingual dictionary to
get the Hindi translation of the component nouns in the sense selected by WSD tool. Thus context based
lexical substitution is accomplished for the target language. The output of lexical substitution is placed
in the translation templates and the resulted construction is searched on a Hindi indexed corpus of 28 million words. For ranking, a reference ranking based on the frequency of occurrence of the translate candidates in full in the TL corpora is taken as baseline. To improve on the baseline, a stronger ranking measure is borrowed from [Tanaka & Baldwin 2003b].
The context based translation system approach is adopted in the present work for building the noun
compound translation system (NCT) which is integrated to Moses. The outputs of Moses and Moses
integrated with NCT are compared. Evaluation of the system is carried out at two levels: by automatic
evaluation metric BLEU and by manual evaluation technique. The issue of automatic evaluation is
discussed in detail which motivates manual evaluation under the given circumstance.