Abstract
The penetration of Internet over the last two decades has witnessed contributors from all around the world. This has led to a vertical expansion of information in different languages. There has been tools developed to handle content in many languages independently. However, when a user wants to view content from several different languages as a search request, the
existing tools doesn’t suffice. Studies reveal that the number of users requesting multilingual information in their search results is on raise.
Information Retrieval (IR), is a field in computer science that deals with information available
in huge volumes, like on Internet, to retrieve and display the information need of an user. This
concept can be extended to the domain of high volumes of multilingual information. When a
user requests for an information need in this collection of multilingual information, the task of retrieving the exact information is very challenging. It has inherent issues of structuring the content, to understand the user information need, to fetch relevant documents, to retrieve similar documents from other languages and finally to rank all documents irrespective of their
language as per their relevancy to the query. One possible solution is to retrieve a result list separately in each language requested by the user and finally merge them to produce a single multilingual result list. This thesis attempts to solve the problem of merging the intermediate result lists by ranking all documents as per their relevancy to the query irrespective of the language.
The task is challenging because merging result lists need the understanding of the content of all documents across languages and ranking them according to their relevancy to the query. Existing several approaches make an extensive use of language resources and tools in achieving this task. However, such approaches are limited across languages, because of their dependency on the resources and tools. Many languages, like Hindi, suffer from the limited availability of the language resources/tools. Developing a language-independent approach is highly crucial for many such languages. The central theme of this thesis is to develop a language-independent approach which is also efficient to merge several monolingual result lists to form a multilingual result list.
A learning-to-rank framework is the proposed solution for this problem. In this framework, for every query-document pair a relevance judgment will be given, indicating if the document is relevant to the query or not. Features will be extracted from the content of query and document. Ranking algorithms will be used in training a model so as to assign a relevancy to a new unseen query-document pair. These relevancies are used in ranking multilingual documents.
In this thesis, we propose certain simple and efficient features that enhances the performance of ranking. All the calculations are performed without using any language tools. To achieve language-independent approach, several alternative methods have been introduced to replace the usage of existing language tools, like lemmatizer, etc.
One of the proposed features involve computing similarity between two different language documents. Comparing multilingual documents is not a trivial task. Bilingual dictionaries are usually used in translating a document before comparing. In this thesis, a new dimension of document representation is explored. Instead of representing documents with words,
we attempt to represent documents as a mixture of the topics/themes inherent in the document.
Latent Dirichlet Allocation (LDA), a topic modeling algorithm, is availed to achieve that. However, to represent multilingual documents as a mixture of topics, a new mathematical formulation is derived over the existing LDA model, we call it MultiLDA. This MultiLDA model has the potential to be applied elsewhere. Various experiments have been performed to validate each and every concept introduced in this thesis. The results reveal the significance of
the proposed solutions.