Abstract
A large portion of the data that resides in Enterprises, exists in the form of unstructured textual data. Unstructured data implies collection of natural language text in the form of word documents, HTML
pages, plain text files, among others. Structured data on the other hand refers to relational databases,
XML documents, etc; where there is a well articulated scheme to represent and store the data. Unstructured data is easy to produce and comprehend by humans, where as it is not trivial for the machines and electronic applications to extract information from it. As unstructured data lacks structure and patterns therefore, in order to target its specific portions for retrieving particular information, full text search has to be carried out.
In between Internet and personal machines, we have the Intranets and data archives of various Enterprise
establishments, which have their own set of guidelines for search. A typical enterprise search
system would be benefited with the thoroughness of desktop search but would not want to suffer from
the inherent slowness of typical on the fly, linear scans used in desktop search. An Internet search
approach on the other hand may be too generic to produce precise results even though it would be advantageous for speedy indexing and retrieval. The difference and the uniqueness in the composition of
enterprise data makes Enterprise search an entirely different beast. This thesis concentrates on enhancing the relevance of results in query based document retrieval in Enterprise Search (ES). Approaches to implement effective ES are different from standard document retrieval and web search. A number of such past and on-going works in ES research area are discussed in this thesis. Our approaches to ES work with two important open research areas in ES, namely, user context aware ES and resolving the issues of vocabulary bias and narrowness in ES. In user context aware ES, we concentrate on user context aware computerized work environment, to facilitate better understanding of user’s information requirements in ES systems. Issues of vocabulary bias and narrowness arise when there is a difference between vocabulary used for queries and that used in the content (though meaning could be the same). In any form of document retrieval, the result-set is strongly query dependent. As a query represents the information need of a user, a misrepresenting or an ambiguously formed query dilutes the result quality. We try to reduce the difference between the vocabulary used in constructing the query and that used in enterprise content for resolving this issue.
We enhance ES by concentrating at two basic aspects of information retrieval, which are query expansion and re-ranking. User and work context aware ES is implemented by through a user role based
query expansion using local & global analysis techniques and re-ranking result-set by a role based document classification. External lexicon and thesaurus repositories constructed from open source on-line encyclopedias such as Wikipedia are used as complementary knowledge bases for implementing the
query expansion for countering the vocabulary bias issues in the enterprise. This is accomplished by extracting and using a Wikipedia concept thesaurus from the article link structure of Wikipeida for query
expansion. The results are re-ranked step using large manually tagged sets rich in categories from all
domains of Wikipedia. These tagged sets are used to train a classifies which classifies search results into
various enterprise document classes. Documents belonging to the dominant classes in the result-sets
are given higher preference in re-ranking. We also introduce broader vocabulary range into result-set
with collection enrichment techniques usingWikipedia article text for pseudo relevance feedback. Both
supervised and unsupervised classification techniques are used to determine good feedback documents
from the pseudo-relevant set.
We evaluate our approahes on two datasets, namely the IIIT-H corpus and the CERC dataset. Role
based personalization in ES required a dataset with role based topics and relevance judgments. As no
such dataset existed, we customized the IIIT-H intranet data for the same. Rest of the techniques were evaluated on standard ES evaluation platform, i.e. the CERC dataset provided by the TREC Enterprise
track. Role based personalization shows improvements for both local and global analysis based query expansion technique as well as for the re- ranking compared to plain test retrieval. Wikipedia concept thesaurus based query expansion reveals gains in recall figures, without diluting the precision in search results. Wikipedia category network based re-ranking method shows moderate improvements in the precision figures for the results. Improvements are also observed in Wikipedia article text based pseudo relevance feedback in comparison to blind relevance feedback without enriching the pseudo relevant