BigFCM: Fast, Precise and Scalable FCM on Hadoop

Clustering plays an important role in mining big data both as a modeling technique and a preprocessing step in many data mining process implementations. Fuzzy clustering provides more flexibility than non-fuzzy methods by allowing each data record to belong to more than one cluster to some degree. However, a serious challenge in fuzzy clustering is the lack of scalability. Massive datasets in emerging fields such as geosciences, biology, and networking do require parallel and distributed computations with high performance to solve real-world problems. Although some clustering methods are already improved to execute on big data platforms, their execution time is highly increased for gigantic datasets. In this paper, a scalable Fuzzy C-Means (FCM) clustering method named BigFCM is proposed and designed for the Hadoop distributed data platform. Based on the MapReduce programming model, the proposed algorithm exploits several mechanisms including an efficient caching design to achieve several orders of magnitude reduction in execution time. The BigFCM performance compared with Apache Mahout K-Means and Fuzzy K-Means through an evaluation framework developed in this research. Extensive evaluation using over multi-gigabyte datasets including SUSY and HIGGS shows that BigFCM is scalable while it preserves the quality of clustering.

You can use the source code for research and non-commercial applications. Please cite to:

Nasser Ghadiri, Meysam Ghaffari, Mohammad Amin Nikbakht, BigFCM: Fast, precise and scalable FCM on hadoop, In Future Generation Computer Systems, Volume 77, 2017, Pages 29-39,

ISSN 0167-739X, https://doi.org/10.1016/j.future.2017.06.010.
(http://www.sciencedirect.com/science/article/pii/S0167739X17312359)
Keywords: MapReduce algorithms; Unsupervised learning and clustering; Data mining; Clustering; Vagueness and fuzzy logic; Big data

 

(For commerical applications please contact nghadiri AT cc.iut.ac.ir)

Heter-LP

Heter-LP is an algorithm for label propagation in heterogenous networks. 

Biomedical text summarization : Itemset-based summarizer

Objective

Automatic text summarization tools can help users in the biomedical domain to access information efficiently from a large volume of scientific literature and other sources of text documents. In this paper, we propose a summarization method that combines itemset mining and domain knowledge to construct a concept-based model and to extract the main subtopics from an input document. Our summarizer quantifies the informativeness of each sentence using the support values of itemsets appearing in the sentence.

Methods

To address the concept-level analysis of text, our method initially maps the original document to biomedical concepts using the Unified Medical Language System (UMLS). Then, it discovers the essential subtopics of the text using a data mining technique, namely itemset mining, and constructs the summarization model. The employed itemset mining algorithm extracts a set of frequent itemsets containing correlated and recurrent concepts of the input document. The summarizer selects the most related and informative sentences and generates the final summary.

Results

We evaluate the performance of our itemset-based summarizer using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics, performing a set of experiments. We compare the proposed method with GraphSum, TexLexAn, SweSum, SUMMA, AutoSummarize, the term-based version of the itemset-based summarizer, and two baselines. The results show that the itemset-based summarizer performs better than the compared methods. The itemset-based summarizer achieves the best scores for all the assessed ROUGE metrics (R-1: 0.7583, R-2: 0.3381, R-W-1.2: 0.0934, and R-SU4: 0.3889). We also perform a set of preliminary experiments to specify the best value for the minimum support threshold used in the itemset mining algorithm. The results demonstrate that the value of this threshold directly affects the accuracy of the summarization model, such that a significant decrease can be observed in the performance of summarization due to assigning extreme thresholds.

Conclusion

Compared to the statistical, similarity, and word frequency methods, the proposed method demonstrates that the summarization model obtained from the concept extraction and itemset mining provides the summarizer with an effective metric for measuring the informative content of sentences. This can lead to an improvement in the performance of biomedical literature summarization.

 

The source code is uploaded here.

Android app: monitoring the growth of children

An android application for monitoring the growth of children based on standard charts. Under the supervision of pediatrics specialists.

ADQUEX federated query processing

ADQUEX is a SPARQL federation engine for linked data to execute the query effectively without any need to prior statistical information. This method can change the query execution plan at runtime so that less intermediate results are produced, and it can also adapt the execution plan to new situation if unpredicted networklatencies arise. The Java source code is available upon request.

FuzzyRCC for PostGIS

Analyzing huge amounts of spatial data plays an important role in many emerging analysis and decision-making domains such as healthcare, urban planning, agriculture and so on. For  extracting  meaningful  knowledge  from  geographical data, the relationships between spatial data objects need to beanalyzed. An important class of such relationships are topological relations  like  the connectedness or  overlap  between  regions. While real-world geographical regions such as lakes or forests do not have exact boundaries and are fuzzy, most of the existing analysis  methods  neglect this inherent  feature  of topological relations.  In this paper, we propose a method for handling the topological relations in spatial databases based on fuzzy region connection calculus (RCC). This project is an implementation of fuzzy RCC in  PostGIS spatial  database. The PostgreSQL source code is available upon request.

AD-FCM

The accuracy of the basic Fuzzy C-Means clustering is subject to false detections caused by noisy records, weak feature selection and low certainty of the algorithm. AD-FCM is a novel algorithm for detecting such ambiguous records in FCM by introducing a certainty factor to decrease the invalid detections. It allows sending the detected ambiguous records to another discrimination method for a deeper investigation, thus increasing the accuracy by lowering the error rate. Most of the records are still processed quickly and with low error rate, preventing performance loss compared to similar hybrid methods. The MATLAB code is available here.

GGeo Spatial Data Mining

An evolutionary approach to spatial data mining based on the MOSES engine from the OpenCog framework. It also provides a plugin for GeoKettle to perform spatial ETL operations using fuzzy region connection calculus. The source code is available from GitHub here.

Peer to peer Privacy Preserving Query Service

A novel query-based approach to protecting the privacy of users in location-based social networks. This approach tries to dispel the privacy concerns with minimum loss of the quality of service with no  security bottleneck. The Java source code is uploaded here.