Abstract
In the big data era, pattern mining is an important task of data mining and involves the extraction
of interesting associations from large databases. In the literature, pattern mining approaches have been
investigated to extract different kinds of patterns such as frequent patterns, sequential patterns, periodic patterns, utility patterns, and coverage patterns from transactional databases. The pattern mining
approaches have been widely employed to improve the performance of applications in the areas of recommendation systems, e-commerce, customer relation management, internet advertising, decision support systems, social media, and so on. One of the key issues of developing pattern mining algorithms
is as follows: a pattern mining algorithm has to process a huge volume of the transactional dataset.
Research efforts are going on to develop pattern mining algorithms to operate on large datasets by exploiting different parallel processing paradigms, namely shared memory, shared disk, shared-nothing,
GPUs, incremental, MapReduce, and Grid. In this thesis, we propose both MapReduce based and Incremental techniques to extract the knowledge of patterns called Coverage Patterns.
The framework of MapReduce has become very popular to process big data. It enables the distributed
processing of huge amounts of data on hundreds of machines in geographically distributed environments
with robust fault-tolerance. With Incremental mining framework, when the increment database is added,
it is possible to improve the performance by limiting the computation to incremental part and with the
required additional processing. Given a transactional database, the coverage pattern mining involves
the extraction of multiple sets of items, where each set (covers a certain percentage of transactions)
satisfies relative frequency, coverage support and the overlap ratio constraints. The applicability of
coverage patterns has been demonstrated in improving the performance of search engine advertising
and banner advertising. In the literature, a level-wise iterative coverage pattern mining (CMine) and a
pattern growth approaches have been proposed.
As a first contribution, we have proposed a MapReduce based coverage pattern mining approach.
The key issue is to compute the overlap ratio value of a pattern in an efficient manner under MapReduce. However, we exploit the observation that it is possible to compute the overlap ratio of a given
pattern by computing the numerator and the denominator independently in a distributed manner. Experimental evaluation with both real-world and synthetic datasets demonstrate that the proposed approach
significantly improves the performance over the existing CMine approach.
Furthermore, we have proposed an incremental coverage pattern extraction algorithm, when an increment of the transactional database (simply Increment) is added to the existing transactional database.
It has been observed that, when the Increment is added, if the order of items in the frequency list (Flist)
of the database is changed, additional patterns have to be validated and require additional processing.
By considering the issue of change in the order of items in the Flist, we have proposed an Incremental
framework to extract coverage patterns. We have also proposed a MapReduce based Incremental approach by combining the advantages of Incremental and MapReduce techniques. We have conducted
experiments on both synthetic and real-world datasets and demonstrated that the proposed approaches
could improve the performance significantly even though increment size is equal to the existing transactional database.
Overall, we have proposed MapReduce, Incremental, and MapReduce based Incremental approaches
to extract the knowledge of coverage patterns from large transactional databases. The efficiency of
the proposed algorithms is demonstrated by conducting extensive experiments on both real-world and
synthetic datasets. We hope that the proposed approaches will enable the applications which extract
coverage patterns to process large transactional databases