Abstract
The growth of Internet, like no other communication medium, has given a "Globalized dimension" to the world. It has become the Universal source of information for millions of people, at home, at school, and at work. Due to numerous contributors across the world, the amount of information available on the web is huge and is distributed across various languages. Hence, there
is a need to develop applications to manage such huge amount of varied information.
Multilingual Document Clustering (MDC) is one such technique which is highly useful for organizing huge amount of data present in different languages.
One of its main applications is Cross-lingual Information Retrieval (CLIR), where search engine takes query in one language and retrieves results in different languages. Instead of providing results as a single long list, it would be very convenient for users if the search engine retrieves them as a list of
clusters, where each cluster contains multilingual documents that are similar. It encourages users to opt for cluster based browsing which is very convenient for processing the results.
Existing approaches in Multilingual document clustering uses various tools and resources such as Manually Annotated (MA) bilingual dictionaries, lemmatizers, Named Entity Recognizers (NERs), etc., for processing the data. However, many languages across the world do not have the luxury of such
tools and resources. Hence, it is very difficult to process and organize the data of languages which do not have sufficient resources.
In this thesis, we focus on clustering multilingual documents of resourcepoor languages. We propose three main approaches which do not make use of
any language dependent resources or tools for multilingual document clustering.
In the rst approach, we perform clustering of multilingual documents using Wikipedia as external knowledge. We avail various features of Wikipedia
such as cross-lingual links, categories, outlinks, Infobox information, etc., to enrich the document representation. It helps in comparing two multilingual
documents eciently for forming good multilingual clusters. The use of external knowledge improves the clustering performance because it considers the
semantic association between the words.
In the second approach, we propose a method to identify the Named Entities (NEs) present in resource-poor languages such as Hindi and Marathi.
We use the Named Entities present in English, which is a resource-rich language for this purpose. The identied Named Entities help in detecting the topic of a given document, which is highly useful for comparing two documents efficiently for improving the performance of multilingual document clustering.
In our final approach, we propose a method which combines our first and second approaches. In this approach, Named Entities and Wikipedia features
are used together in enriching the document representation for forming good multilingual clusters.
In all our approaches we use Bisecting k-means algorithm for forming multilingual clusters. We do not make use of any non-English linguistic tools or
resources such as WordNet, Part-Of-Speech tagger etc., which makes our proposed work easy to extend to other languages. Experiments are conducted
on English, Hindi and Marathi news datasets provided by FIRE1 for their 2010 Ad-hoc Cross-Lingual document retrieval task on Indian languages. All
our approaches are evaluated using F-score, Purity and Normalized Mutual Information (NMI) measures and the results obtained are encouraging.