Abstract
Processing a sentence for any practical application requires complete understanding of the syntactic
and semantic structure of the sentence. Natural language parsers, are used to perform formal analysis
of a sentence into its constituents(words or phrases), resulting in a parse tree showing their syntactic relation to each other. For example, which groups of words in a sentence go together as phrases and which words are the subject or object of a verb. Natural language parsers are generally two types,
Rule based parsers- often constructs a parser given a grammar and data-driven parsers - construct a
parser given a treebank (sentences with the whole syntactic information annotated in them in the form
of tags and dependent relations). The accuracy of these parsers, both rule based and data driven often
depends on the complexity of the sentence they are parsing. The complexity varies from the ambiguous
structures/words, out of vocabulary words, loosely annotated syntactic information of a sentence like
Parts-of-speech tags, syntactic relation between phrases or words etc. In case of data-driven parsers,
they use knowledge of language processes from the sentences they learn on to produce the most likely
analysis of new sentences, hence it is very important for the parsers to extract enough and proper information from the sentence before it parse.
Consider an example like below where all the syntactic information like subject, object, verb etc is
present in the sentence and it is easy for any application to extract complete information.
Jon and Mary went to hospital
Now, consider the following example
‘John can sing a song, and Mary can too’
Sentences like these are very common to appear in the treebanks, as the treebanks are created from
a wide range of sources like articles, books, discourses. As mentioned earlier, to accurately predict the
syntactic structure of these sentences, it needs the syntactic information annotated in them. Now, if we
look at the example above, the complete syntactic information expected from the parsers are
‘John can sing a song, and Mary can (sing) too’
In the first example, ‘John can sing a song, and Mary can too’ Whether or not the verb ‘sing’ is included
in this sentence is up to the speaker and to communicative aspects of the situational context in which the sentence is uttered. But, for the parser, the verb ‘sing’ along with its information like POS tag and its relation with other elements in the sentence, very important to be part of the sentence to parse it properly. The missing elements not only create a problem for parsing the sentence, but for any application that requires sentence syntactic understanding, for instance, information extraction, question-answering, and related semantic tasks, machine translation, parts-of-speech tagging, morphology etc. The more precise and detailed our predicate argument structures are (including empty categories), the more complete our event descriptions will be, and therefore the more effective our semantic processing techniques will be.
We call these missing words/phrases as ‘Empty Categories’ or ‘Null Elements’.
In this work, empty categories are retrieved in two ways:
1. Retrieve empty categories in sentence itself
2. Retrieve empty categories in parse trees while parsing
Inserting in the sentence is handled statistically with the help of machine learning algorithms and by
leveraging this method as a postprocessing step, i.e empty categories are inserted on the parser output.
Then we will see how to detect and retrieve empty categories in a data-driven parser itself. We will then
move to handle empty categories in Hindi dependency treebank by analysing and classifying the empty
categories and then identify various discovery procedures to automatically detect the existence of these categories in a sentence. For this we make use of lexical knowledge along with the parsed output from
a constraint based parser. Through this work we show that it is possible to successfully discover certain
types of empty categories while some other types are more difficult to identify. This work leads to the
state-of-the-art system for automatic insertion of empty categories in the Hindi input sentence and a
generic approach to handle empty categories in any language dependency treebanks.