Abstract
Various philosophers including Bertrand Russell and Ludwig Wittgenstein have shared the view
that extra-linguistic reality and its nature are somehow linked with language. Given that language is
a medium of capturing the objective truth around us, it is possible that the formal logic of language
could also be inspired by how the world around us operates. Reality as we understand it, is a series of
continuous interactions between objects. These interactions are loosely referred to us by as events.
With constant rise in content generated every day, it has become arduous task to keep track of all the
events happening in the real world. A major challenge information access tasks face is that they have
to deal with colossal amounts of unstructured data, i.e., data whose format loosely implies its meaning
described in a human understandable language. This makes understanding event mentions in text a key
problem in Natural Language Processing. Event detection is an important Natural Language Processing
and Information Retrieval problem, that has been extended to multiple domains and is now being tackled
for languages other than English.In this thesis, we address the problem of event detection in Hindi. Hindi, like many other languages
is an object-centric language that grounds a metaphysical system in objects, with sentence constructions
mainly describing the properties (state) of a noun or the interaction between these nouns. If we were
to use Hindi to define the nature of being, real world events would be formally captured as a series of
interactions between nouns. We are further motivated by the fact that, in the past few years, the doctrine
that actions are events has become an essential, and sometimes an unargued part of the received view
in the philosophy of action, despite the efforts of a few philosophers to undermine the consensus. This
task is an important one given the constant rise in volume of Hindi data being produced and consumed.
A recent report shows that Hindi content consumption on the web is growing at a rate of 94% per year
as compared to English’s 19%1
.
Given a low resource language like Hindi, with abundance in unstructured text, we believe leveraging systems to automatically detect events in unstructured texts can enable significant advances in the
performance of various information access tasks. The task of event detection in Hindi is relatively unexplored, partly because of syntactic features such as relative free word order and nature of verb adjuncts
and partly due to the lack of structured data in Hindi. These properties of Hindi, increase the number of
options available for expressing structured information, hence significantly complicating how chunks of
text can be semantically interpreted.From the perspective of extracting latent information, defining events as semantico-syntactic objects allows for extracting relations between both events and entities, that are not explicitly mentioned.
The extracted event information can then be readily consumed by downstream tasks such as Question
Answering, Summarization and Knowledge Graph generation.
The initial work presented in this thesis is directed towards tackling two major challenges. The
primary being understanding how event mentions are captured in Hindi. This task, motivated by the
interesting property of Hindi, where the syntax of the sentence is influenced by the event-semantics,
becomes a compelling one. The observations of our studies are captured and formulated as two documents (annotation guidelines and annotation specifications) that allow accurate identification of event
mentions in Hindi text.
The second challenge we tackle is that of automating the task of event mention detection in Hindi
texts. In order to kickstart exploration in the domain of machine learning approaches, we build and
publish the largest event annotated dataset for Hindi. Our research sheds lights on the various possible
methods to automate event detection in Hindi texts and compares the impact of techniques built on top
of and without hand-crafted features.
Machine learning techniques for automated event-detection have advanced fairly owing to the current
rise in introduction and adoption of sophisticated neural architectures. However, these architectures are
plagued by biases in opportunistically collected data sets for major languages. As a result of this, there
is little to no progress in development of resources and solutions for low resource languages. Driven by
this motivation, the final part of work attempts to tackle this by introducing an architecture, the MultiLingual Sequence Tagger (M-LiST) capable of training on a combination of four monolingual datasets
and attaining state of the art performance for open domain event detection on the in three languages and current best performance for one of the languages.