Abstract
Synopsis
A noun compound (NC) is a sequence of nouns acting as a single noun, for example, ‘colon cancer’,
‘suppressor protein’, ‘colon cancer tumor suppressor protein’. While the above two NCs: ‘colon cancer’
and ‘suppressor protein’ combine to form a very complex NC, <<< colon cancer > tumor >
< suppressor protein >> , these two NCs themselves encode very different information which is not
explicit at the surface level. ‘Colon cancer’, the cancer in the large intestine region, encodes a Spatial
information whereas ‘Suppressor protein’ (usually for tumor), the protein that suppresses the cellular
proliferation by blocking the unscheduled cell division, indicates the Purpose of the protein. This makes
the task of interpretation of NCs quite significant for the understanding of natural language text. Their
frequent and productive nature in language has made their interpretation even more essential and interesting.
They comprise 3.9% and 2.6% of all tokens in the Reuters corpus and the British National
Corpus (BNC) respectively [1]. The research for the interpretation on NCs has followed two directions:
(i) Identifying the semantic relation between the modifier and the head of the NC [49, 3, 39] and (ii)
Paraphrasing of the NC [7, 31]. The NC ‘Colon Cancer’ in the above example encodes a Spatial relation
and is paraphrased as ‘cancer in colon’ while the NC ‘Suppressor Protein’ has a Causal-Purpose
relation and is paraphrased as ‘protein used for suppressing (tumor)’. In our work, we focus on identifying
the semantic relation between the modifier and the head of only binary Noun Compounds by
extracting their paraphrases from a large corpus. The ability to identify semantic relations of NC is
useful in many NLP tasks including Question Answering (QA), Paraphrasing and Machine Translation
(MT), Knowledge Base (KB) acquisition etc.
The task of automatic interpretation of the semantic relation of the NC has been a major area of interest
for more than a decade now. The problem has been attempted as a supervised-learning problem with
both the Ontology-based and the Statistical approaches. We work with both of these approaches and
gain useful insights for developing a complementing Hybrid model integrating the two approaches. The
ontology-based model uses a knowledge-intensive ontology (i.e. WordNet) as the source of knowledge.
We implement and experiment with two existing ontology models, SemScat 1 by [30] and SemScat 2 by
[3], which use theWordnet’s rich noun hypernymy hierarchy for forming the classification features. The
SemScat 2 model stores knowledge at multiple levels ranging from most general to the most specific
level and outperforms the SemScat 1 model that stores only a single generalized level of knowledge.
The further investigation reveals that the ontology model performs quite well when the test instance
matches the specific knowledge in the model while assigns the correct relation with low accuracy when
vi
vii
the general level of knowledge is matched. The ontology model has few limitations. Firstly, it fails to
disambiguate between the NCs of certain relations and, secondly, it requires the wordnet sense for the
head and the modifier of the NC to be annotated to extract their corresponding hypernymy hierarchies
and since, Word Sense Disambiguation (WSD) is a challenging semantic task in itself, it is a big constraint
of the usage of this model. The statistical model, on the other hand, extracts paraphrases of the
NC from a large corpus and uses them for identifying its semantic relation [43, 57]. With the increasing
web-data, there has been increased interest in statistical approach.
We propose a novel corpus-based model which uses prepositional paraphrase (‘protest by students’),
verbal paraphrase (‘protest involving students’) and verb+prepositional paraphrase (‘protest started by
students’) extracted from a large corpus (i.e. Google N-Gram) and represent each NC with a prepositional
and verbal vector. With the relation of each NC in hand (supervised-learning), the NC vectors
are transformed to Relation Vectors. Each relation vector is a single pair of prepositional and verbal
vector which incorporates the behavior of the entire class and thus, play significant role in our VSM
models. We work extensively on investigating the relative relevance of each paraphrase in identifying
a relation and use a modified TF-IDF weighting function to assign higher weights to the paraphrases
that are more relevant. The performance of the corpus model is compared to the ontology model on two
datasets: [41] dataset of 600 NCs where modifier can be adjective, adverb or noun and [20] dataset of
gold-paraphrased dataset of 395 Noun-Noun compounds. The results of corpus based model are quite
promising.
Ultimately, we propose a Hybrid model, which uses the knowledge of both the knowledge-intensive
ontology model