As the size of networks increases, it is becoming important to analyze large-scale network
data. A network clustering algorithm is useful for analysis of network data. Conventional network
clustering algorithms in a single machine environment rather than a parallel machine
environment are actively being researched. However, these algorithms cannot analyze
large-scale network data because of memory size issues. As a solution, we propose a network
clustering algorithm for large-scale network data analysis using Apache Spark by
changing the paradigm of the conventional clustering algorithm to improve its efficiency in
the Apache Spark environment. We also apply optimization approaches such as Bloom filter
and shuffle selection to reduce memory usage and execution time. By evaluating our proposed
algorithm based on an average normalized cut, we confirmed that the algorithm can
analyze diverse large-scale network datasets such as biological, co-authorship, internet
topology and social networks. Experimental results show that the proposed algorithm can
develop more accurate clusters than comparative algorithms with less memory usage. Furthermore,
we confirm the proposed optimization approaches and the scalability of the proposed
algorithm. In addition, we validate that clusters found from the proposed algorithm
can represent biologically meaningful functions.
dc.language
eng
dc.relation.ispartofseries
PLOS one
dc.title
CASS: A distributed network clustering algorithm based on structure similarity for large-scale network