Abstract
The role of Internet in personal, economic and political advancement is growing in a fast pace. By the turn of century, data on web reaches to petabytes or exabytes or may even scale up-to unimaginable quantities. Extraction of precise and structured information from such large amounts of unstructured or semi-structured data is the major concern of web known as Information Extraction.
Named entity recognition (NER) (also known as entity identification and entity extraction) is one of the important subtask of information extraction that seeks to locate and classify atomic elements in text
into predefined categories such as the names of persons, organizations, locations, monetary values, percentages, expressions of times, etc. NER has many applications in NLP, for e.g., in data classification, question answering, cross language information access, machine translation system, query processing, etc.
Recognizing Named Entities (NEs) in English has reached accuracies nearing to 98%. For English,
many cues aid to know the structure of language (one such important cue in identifying NEs is capitalization) which made the accuracies to be high. Whereas in Indian languages, there are no such cues
available and moreover each Indian language differ from the other in grammatical structure. Hence,
developing a language independent NER is a challenging task.
Previous works includes developing an NER system using language dependent tools such as POS
Tagger, dictionaries, Chunk Tagger, gazetteer lists, etc., or they have used linguistic experts to manually
tag the training and testing data or linguistic experts used to generate rules for recognizing NEs. Language Independent approaches include supervised machine learning techniques such as CRF, HMM, MEMM, SVM, etc. These techniques need High amounts of manually tagged data which is again a point of concern. Some of the other approaches include exploiting the external knowledge such asWikipedia.
But, in those methods the utilization of Wikipedia is not complete. Hence, the main objective of this
work is to build a language independent NER system without any manual intervention and without any
usage of language dependent tools.
The approach specified throughout the work, includes language independent methods to identify, extract and recognize the NEs. Identification of NEs is done using an External Knowledge namely Wikipedia. More specifically, English Wikipedia is used as an aid to derive the NEs from Indian languages. Wikipedia hierarchal structure is explored and the documents in it are divided into specific domains. Each domain is considered and the corresponding English and Indian language documents are clustered. English documents are tagged using the Stanford NER Tagger and the non-NEs are removed. Using the term co-occurrences between the tagged English and non-tagged Indian language words, the corresponding NEs between Indian language and English are mapped. Thus the tag of English NE is duplicated to the Indian language NE. Hence, the Indian language data is tagged.
The tagged data generated in previous step, is used in recognition of NEs on sets of monolingual
Indian language documents. In this step, a set of features are generated from the words of these documents and these features are used for recognition of NEs in a new document. Consider each document; extract the tagged data from the document using the data from previous step. Now, from the remaining words of the document, a Naive Bayes Classifier is build which uses these words to generate a set of features for each class (features here are nothing but the important words of a particular class in that document). The importance of these features is calculated statistically by different metrics (the metrics for classification). Now given a new document, the presence of these features along with their scores is calculated. If the score exceed a threshold, implies the presence of NEs in the document. By decreasing the size of document the process is repeated again till we get the NE. Hence, the monolingual Indian language document is tagged.
The approach specified in identifying and recognizing the NEs is language independent and can be extended to any language as none of the language dependent tools are used or there is no involvement
of linguistic experts. Hindi, Marathi and Telugu were the languages in which the work has been done.
PERSON, LOCATION and ORGANIZATION were the tag of NEs used throughout the identification
and recognition process.
Wikipedia is used as a dataset in identifying the NEs. Around 3,05,574 English documents, Hindi 100,000 documents, Marathi 83,000 documents, Telugu 85,000 documents are used to generate the results. The results are evaluated on manually tagged 2328, 1658, 2200 Hindi, Marathi and Telugu Wikipedia documents respectively. The F-Measure scores are 80.42 for Hindi, 81.25 for Marathi and 79.98 for Telugu.
Dataset for recognition of NEs is a