Abstract
The huge growth in digital libraries, WWW (world wide web) and other text resources has created the demand of techniques which can work intelligently on limited linguistic resources and fulfills the growing demand of text analysis, pattern discovery and information management related operations. The graph based automated text analysis and text mining methods have received a great deal of attention in solving these issues. Actually, an important aspect of graph-based method is that it does not require deep linguistic knowledge, nor domain or language specific annotated corpora, which makes it highly portable to other domains, genres, or languages. The development of advanced graph theoretical techniques for social media mining has also enriched this area. However, the development of effective core techniques to mine the information, from text document(s) by using limited linguistic resources is a challenging task of significant interest. Still The huge growth in digital libraries, WWW (world wide web) and other text resources has created the demand of techniques which can work intelligently on limited linguistic resources and fulfills the growing demand of text analysis, pattern discovery and information management related operations. The graph based automated text analysis and text mining methods have received a great deal of attention in solving these issues. Actually, an important aspect of graph-based method is that it does not require deep linguistic knowledge, nor domain or language specific annotated corpora, which makes it highly portable to other domains, genres, or languages. The development of advanced graph theoretical techniques for social media mining has also enriched this area. However, the development of effective core techniques to mine the information, from text document(s) by using limited linguistic resources is a challenging task of significant interest. Still sincere attentions for improvements in a lot of explored core areas are required. Additionally, the exploring of unexplored/less explored areas can be very useful and key ingredient of several text mining applications.
Based on the above discussed facts, we have identified some core issues (and techniques for them) like: (i) meaningful phrase identification (ii) differentiating role and sense of words, preferably via a single measure, (iii) handling information gap at the phrase level by using unsupervised scheme, (iv) integrating the importance of words as a core feature and (v) identifying group semantics and/or logically related features, (vi) sentence abstraction and so on.
These techniques are very useful for multiple text mining applications like: (a) Document summarization, (b) Summarization Evaluation, (c) Document Clustering, (d) Key phrase Extraction and (e) Automatic Question Answering. The effective improvement in the results of our devised applications, over state-of-the-arts supervised, unsupervised applications, which use linguistic support and domain knowledge etc., prove the effectiveness of the proposed techniques. Our contributions related to the development in the core techniques for domain independent traditional text mining include:
(i) Phrase Identification: Identifying useful phrases are still an open and challenging task, as, available linguistic and statistical state-of the art techniques, still suffer a lot of problems. The performance of several text mining tasks depends upon the quality of the identified phrases. We introduce the centrality measures based technique for phrase identification having accuracy more than 90%. We effectively apply this technique to key-phrase extraction (which is itself treated as core task for several applications like: document categorization, indexing, skimming etc.) and document clustering.
(ii) Differentiating between role and sense of words: Often times words having same sense may have different roles in document(s). Neglecting such information may misguide us, especially when we compare two different texts having matching words, like: summarization evaluation, and evaluation of descriptive answers and essays, etc. We introduce the use of graph based mapping of co-occurring words and closeness centrality score for this.
(iii) Reducing information gap between documents: We introduce the Wikipedia anchor text community detection based scheme, to reduce the information gap between N-grams that are conceptually-related, despite not having a match owing to differences in writing scheme or strategies. We successfully applied this scheme in document clustering.
(iv)Integrating the importance of words as a core feature: Most of the traditional text mining techniques treat word/phrases as point/ “raw data” and thus ignore their importance. We believe that treating every word or phrase as a separate data point by language independent text mining techniques may be inappropriate and even known data mining techniques may not be applicable straightaway. This is becau