Abstract
Every data set has a story to tell and has a unique set of characteristics depending on various factors
like (a) the domain of the data, (b) external factors forcing the data to follow some pattern, (c) data
instances, (d) various dimensions and their data types, (e) interactions between the instances and the dimensions and (f) most importantly, the underlying patterns in the data set. Interactions are the various types of relationships that the instances and dimensions share; for instance, similarities, dissimilarities, nearest neighbor relations, or any other semantic relation defined over data instances and dimensions.
Various knowledge discovery techniques aim to obtain these characteristics of data or parts of it -
by analyzing the data, identifying common patterns (like classes, clusters and frequent itemsets) and
anomalous patterns (i.e., outliers). With very large data sets, the summarized characteristics could
also be large, in contrast to the small size they should ideally represent. Effective presentation of the
summary of data is a great utility for the end-user.
The aim of this dissertation is to capture data’s characteristics with the underlying patterns as clusters and outliers. And later, present a visual information thumbnail of the data’s characteristics for the user to analyze and understand the data and patterns.
Capturing data’s underlying patterns is achieved with data clustering and outlier detection algorithms
based on the concept of reverse nearest neighbors. This dissertation elaborates on the properties and
behavior of reverse nearest neighbor sets useful in the process of detecting outliers and identifying
clusters. We propose a suite of techniques to: (i) identify candidate outliers, (ii) identify clusters, (iii)
identify local outliers, (iv) rank outliers, (v) obtain agglomerative clustering and finally (vi) obtain a
stable clustering result. We also propose two cluster validity measures based on proximity graphs to
evaluate the “goodness” of the clustering results.
Visual information thumbnail is generated by using the results obtained from the clustering algorithm
and the subspace data analysis done using nearest neighbors. The information thumbnail incorporates
various levels of detail, displayed on-demand to the user.