Parallel Nonnegative Matrix Factorization with Manifold Regularization
Hindawi
Journal of Electrical and Computer Engineering
Volume 2018, Article ID 6270816, 10 pages
https://doi.org/10.1155/2018/6270816
Research Article
Parallel Nonnegative Matrix Factorization with
Manifold Regularization
Fudong Liu , Zheng Shan, and Yihang Chen
State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, Henan 450001, China
Correspondence should be addressed to Fudong Liu;
Received 12 November 2017; Revised 7 February 2018; Accepted 15 March 2018; Published 2 May 2018
Academic Editor: Tongliang Liu
Copyright © 2018 Fudong Liu et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Nonnegative matrix factorization (NMF) decomposes a high-dimensional nonnegative matrix into the product of two reduced
dimensional nonnegative matrices. However, conventional NMF neither qualifies large-scale datasets as it maintains all data in
memory nor preserves the geometrical structure of data which is needed in some practical tasks. In this paper, we propose a
parallel NMF with manifold regularization method (PNMF-M) to overcome the aforementioned deficiencies by parallelizing the
manifold regularized NMF on distributed computing system. In particular, PNMF-M distributes both data samples and factor
matrices to multiple computing nodes instead of loading the whole dataset in a single node and updates both factor matrices
locally on each node. In this way, PNMF-M succeeds to resolve the pressure of memory consumption for large-scale datasets and
to speed up the computation by parallelization. For constructing the adjacency matrix in manifold regularization, we propose a
two-step distributed graph construction method, which is proved to be equivalent to the batch construction method. Experimental
results on popular text corpora and image datasets demonstrate that PNMF-M significantly improves both scalability and time
efficiency of conventional NMF thanks to the parallelization on distributed computing system; meanwhile it significantly enhances
the representation ability of conventional NMF thanks to the incorporated manifold regularization.
1. Introduction
Data representation is a fundamental problem in data analysis. A good representation typically uncovers the latent structure of a dataset by reducing the dimensionality of data. Several methods including principal component analysis (PCA),
linear discriminant analysis (LDA), and vector quantization (VQ) have addressed this issue. Recently, nonnegative
matrix factorization (NMF) [1] incorporates nonnegativity
constraint to obtain parts-based representation of data, and
thus it has been widely applied in many applications, such as
document clustering [2, 3], image recognition [4, 5], audio
processing [6], and video processing [7].
However, conventional NMF suffers from a few deficiencies: (1) conventional NMF usually works in batch mode
and requires all data to reside in memory, and this leads
to tremendous storage overhead as the increase of the data
samples and (2) conventional NMF ignores the geometrical
structure embedded in data and causes unsatisfactory representation ability. To overcome the first deficiency, either
parallel or distributed algorithms have been proposed for
NMF to fit for large-scale datasets. Kanjani [8] utilized
multithreading to develop a parallel NMF (PNMF) based on
multicore machine. Robila and Maciak [9] introduced two
thread-level parallel versions for traditional multiplicative
solution and adaptive projected gradient method. However,
their methods are prohibited for large-scale datasets due to
the memory limitation of a single computer. Liu et al. [10]
proposed a distributed NMF (DNMF) to analyze large-scale
web dyadic data and verified its effectiveness on distributed
computing systems. Dong et al. [11] also attempted to design
a PNMF based on the distributed memory platform with
the message passing interface library. Although all the above
parallel NMF algorithms achieve a considerable speedup
in terms of scalability, they cannot consider the geometric
structure in dataset. To overcome the second deficiency, Cai
et al. [12] proposed graph-regularized NMF (GNMF) which
extended conventional NMF by constructing an affinity
graph [13] to encode the geometrical information of data
and enhanced representation ability. Gu et al. [14] further
2
extended GNMF to avoid trivial solution and scale transfer
problems by imposing a normalized cut-like constraint on
the cluster assignment matrix. Lu et al. [15] incorporated
manifold regularization into NMF for hyperspectral unmixing and obtained desirable unmixing performance. These
improved algorithms get better representation ability but
work inefficiently for large-scale datasets. Liu et al. [16]
introduced the geometric structure of data into incremental
NMF [17, 18] and utilized two efficient sparse approximations,
buffering and random projected tree, to process large-scale
datasets. Yu et al. [19] also presented an incremental GNMF
algorithm to improve scalability. But these algorithms only
performed well for incremental or streaming datasets and
could not deal with large-scale batch datasets. In addition,
Guan et al. [20] and Liu et al. [21], respectively, introduce
Manhattan distance and large-cone penalty for NMF to
improve representation and generalization ability.
In conclusion, none of the above works can simultaneously overcome both deficiencies due to the great computation for calculating the decomposition and the storage
requirement of the adjacency matrix. In this paper, we
take the best of advantages of parallel NMF and manifold
regularized NMF and design a parallel NMF with manifold
regularization method (PNMF-M) by parallelizing manifold
regularized NMF on distributed computing systems. In particular, PNMF-M distributes both data samples and factor
matrices to multiple computing nodes in a balanced way and
parallelizes the update for both factor matrices locally on each
node. Since the graph construction is the bottleneck of the
computation of PNMF-M, we adopt a two-step distributed
graph construction method to compute the adjacency matrix
and obtain an adjacent graph equivalent to that constructed
in batch mode. Experimental results on popular text corpora
and image datasets show that PNMF-M not only outperforms
conventional NMF in terms of both scalability and time efficiency by parallelization, but also significantly enhances the
representation ability of conventional NMF by incorporating
manifold regularization.
Journal of Electrical and Computer Engineering
𝑉 ∈ R𝑘×𝑛 . Given that there are 𝑝 processing units, both 𝑘
and 𝑛 are exactly divisible by 𝑝. They divide the above three
matrices into 𝑝 equally sized submatrices by column; that is,
𝑋 = [𝑋1 , . . . , 𝑋𝑝 ], 𝑈 = [𝑈1 , . . . , 𝑈𝑝 ], and 𝑉 = [𝑉1 , . . . , 𝑉𝑝 ].
They still minimize t (...truncated)