Constrained Projection Approximation Algorithms for Principal Component Analysis
0
Advanced Brain Signal Processing Lab,
Brain Science Institute
, RIKEN, 2-1 Hirosawa, Wako-shi,
Saitama 351-0198, Japan
1
Department of Physics, Pohang University of Science and Technology
, San 31 Hyoja-dong, Nam-gu, Pohang 790-784,
Korea
2
Department of Computer Science, Pohang University of Science and Technology
, San 31 Hyoja-dong, Nam-gu, Pohang 790-784,
Korea
In this paper, we introduce a new error measure, integrated reconstruction error (IRE) and show that the minimization of IRE leads to principal eigenvectors (without rotational ambiguity) of the data covariance matrix. Then, we present iterative algorithms for the IRE minimization, where we use the projection approximation. The proposed algorithm is referred to as COnstrained Projection Approximation (COPA) algorithm and its limiting case is called COPAL. Numerical experiments demonstrate that these algorithms successfully find exact principal eigenvectors of the data covariance matrix.
1. Introduction
Principal component analysis (PCA) or principal subspace analysis (PSA) is a
fundamental multivariate data analysis method which is encountered into a variety of
areas in neural networks, signal processing, and machine learning [10]. A variety
of adaptive (on-line) algorithms for PCA or PSA can be found in literature [4, 5,
7, 11, 13, 15] (see also [8] and references therein). Most of these algorithms are
gradient-based learning algorithms, hence the convergence is slow.
The power iteration is a classical method for estimating the largest eigenvector
of a symmetric matrix. The subspace iteration is a direct extension of the power
iteration, computing subspace spanned by principal eigenvectors of a symmetric
matrix. The natural power method is an exemplary instance of the subspace
iteration, where the invariant subspace spanned by the n largest eigenvectors of the
data covariance matrix, is determined [9]. The natural power iteration provides
a general framework for several well-known subspace algorithms, including Ojas
subspace rule [11], PAST [16], and OPAST [1].
A common derivation of PSA, is terms of a linear (orthogonal) projection W =
[w1, . . . , wn] Rmn such that given a centered data matrix X = [x(1), . . . , x(N )]
RmN , the reconstruction error X W W X 2F is minimized, where F denotes
the Frobenius norm (Euclidean norm). It is known that the reconstruction error is
blind to an arbitrary rotation of the representation space. The minimization of the
reconstruction error leads to W = U 1Q, where Q Rnn is an arbitrary orthogonal
matrix and the eigendecomposition of the covariance matrix C = XX is given by
C =
where U 1 Rmn contains n largest eigenvectors, U 2 Rm(mn) consists of the rest
of eigenvectors, and associated eigenvalues are in 1, 2 with 1 > 2 > > m.
Probabilistic model-based method for PCA was developed, where the linear
generative model was considered and expectation maximization (EM) optimization
was used to derive iterative PCA algorithms, including probabilistic PCA (PPCA)
[14] and EM-PCA [12]. These algorithms are batch algorithms that find principal
subspace. Hence, further post-processing is required to determine exact principal
eigenvectors of the data covariance matrix, without rotational ambiguity. The
natural power iteration is also a PSA-type algorithm, unless the deflation method is
used.
In this paper, we present iterative algorithms which determine the principal
eigenvectors of the data covariance matrix in a parallel fashion (in contrast to
the deflation method). To this end, we first introduce the integrated
reconstruction error (IRE) and show that its minimization leads to exact principal
eigenvectors (without rotational ambiguity). Proposed iterative algorithms emerge from the
minimization of the IRE and are referred to as COPA algorithm and COPAL (the
limiting case of COPA). These algorithms are the recognition model counterpart
of the constrained EM algorithm in [2, 3] where principal directions are estimated
through alternating two steps (E and M steps) in the context of the linear
coupled generative model. In contrast, our algorithms COPA and COPAL, need not
go through two steps, which is a major advantage over EM type algorithms.
2. Integrated Reconstruction Error
It was shown in [16] that the reconstruction error JRE = X W W X 2F attains
the global minimum if and only if W = U 1Q. Now, we introduce the IRE that is
summarized below.
DEFINITION 1 (IRE). The integrated reconstruction error, JIRE, is defined as a
linear combination of n partial reconstruction errors (PRE), Ji = X W Ei W X 2F ,
i.e.,
JIRE(W ) =
i=1
[Ei ]jj =
Proof. See Appendix.
The last term in IRE, Jn, is the standard reconstruction error. It was shown in
[16] that W is a stationary point of Jn if and only if W = U 1Q (hence W W =
I is satisfied). All stationary points of Jn are saddle points, except when U 1
contains the n dominant eigenvectors of C. In that case, Jn attains the global
minimum.
The standard reconstruction error Jn is invariant to an orthogonal transform
Q because WQQ W = W W . In contrast, the IRE is not invariant under an
orthogonal transform, since QEiQ = Ei . This provides an intuitive idea why
the IRE minimization leads to the principal eigenvectors of the data covariance
matrix, without rotational ambiguity.
PREs Ji are of the form
Ji =
X
w1w1 + + wi wi
where wi represents the ith column vector of W . The PRE i + 1, Ji+1,
represents the reconstruction error for (i + 1)-dimensional principal subspace which
completely includes i-dimensional principal subspace. Therefore, the
minimization of JIRE implies that each PRE Ji for i = 1, . . . , n, is minimized. The
graphical representation is shown in Figure 1, where a coupled linear recognition
model is described, with a link of the IRE minimization.
Minimizing each Ji is reminiscent of the deflation method where the
eigenvectors of C are extracted one by one. Thus, it is expected that the
minimization of IRE leads to principal eigenvectors of C. However, a major difference
between the deflation method and our method is that the former extracts
principal components one by one and the latter find principal components
simultaneously.
Iterative Algorithms
The projection approximation [16] assumes that the difference between W (k+1)X
and W (k)X is small, which leads us to consider the following objective function
i=1
X W (k+1)Ei Y (k)
where Y (k) = W (k)X.
The gradient of (3) with respect to W(k+1) is given by
JIRE
W (k+1)
= XY (k)
is the Hadamard product (element-wise product).
With these definitions, it follows from
JIRE
W (k+1)
= 0 that
1
1
1
= XY
= XY
1
U(Yij ) =
if i j .
where U(Y ) is an element-wise operator, whose arguments Yij are transformed by
The operator U(Y ) results from the structure of
1 given by
1
.
.
.
W (k+1) = CW (k) U
W (k)CW (k)
1 .
UT (Yij ) =
This leads to the COPAL algorithm
W (k+1) = CW (k) UT
W (k)CW (k)
1 .
Replacing Y (k) by W (k)X, leads to the updat (...truncated)