Semi-supervised local Fisher discriminant analysis for dimensionality reduction
Masashi Sugiyama
0
1
Tsuyoshi Id
0
1
Shinichi Nakajima
0
1
Jun Sese
0
1
Roni Khardon.
0
T. Id IBM Research, Tokyo Research Laboratory
, 1623-14 Shimo-tsuruma, Yamato-shi, Kanagawa 242-8502,
Japan
1
M. Sugiyama ( ) Department of Computer Science, Tokyo Institute of Technology
, 2-12-2 O-okayama, Meguro-ku,
Tokyo 152-8552, Japan
When only a small number of labeled samples are available, supervised dimensionality reduction methods tend to perform poorly because of overfitting. In such cases, unlabeled samples could be useful in improving the performance. In this paper, we propose a semi-supervised dimensionality reduction method which preserves the global structure of unlabeled samples in addition to separating labeled samples in different classes from each other. The proposed method, which we call SEmi-supervised Local Fisher discriminant analysis (SELF), has an analytic form of the globally optimal solution and it can be computed based on eigen-decomposition. We show the usefulness of SELF through experiments with benchmark and real-world document classification datasets.
1 Introduction
The goal of dimensionality reduction is to obtain a low-dimensional representation of
highdimensional data samples while preserving most of the intrinsic information contained in
the original data (Roweis and Saul 2000; Tenenbaum et al. 2000; Hinton and Salakhutdinov
2006). If dimensionality reduction is carried out appropriately, the compact representation
of the data can be used for various tasks such as visualization and classification.
In supervised learning scenarios where data samples are accompanied with class labels,
Fisher discriminant analysis (FDA) (Fisher 1936; Fukunaga 1990) is a popular
dimensionality reduction method. FDA seeks an embedding transformation such that the
betweenclass scatter is maximized and the within-class scatter is minimized. FDA works very well
if the samples in each class follow Gaussian distributions with a shared covariance
structure. However, FDA tends to give undesired results if the samples in a class form several
separate clusters or there are outliers (Fukunaga 1990). To overcome this drawback, Local
FDA (LFDA) has been proposed (Sugiyama 2007). LFDA localizes the evaluation of the
within-class scatter, and thus works well even when within-class multimodality or outliers
exist. In addition, LFDA overcomes a critical limitation of the original FDA in
dimensionality reductionthe dimension of the FDA embedding space should be less than the number
of classes (Fukunaga 1990), while LFDA does not suffer from this restriction in general.
Moreover, LFDA was shown to compare favorably with other supervised dimensionality
reduction methods through experiments (Sugiyama 2007).
However, the performance of LFDA (and all other supervised dimensionality reduction
methods) tends to be degraded when only a small number of labeled samples are available.
Namely, the supervised dimensionality reduction methods tend to find embedding spaces
which are overfitted to the labeled samples. In such cases, it is effective to make use of
unlabeled samples that are often available abundantlysuch a setup is called semi-supervised
learning (Chapelle et al. 2006). Through extensive experiments, it was shown that
principal component analysis (PCA) (Jolliffe 1986), which is an unsupervised dimensionality
reduction method for preserving the global data structure, works moderately well in
semisupervised learning scenarios (see e.g., Chap. 21 of Chapelle et al. 2006).
Although PCA was reported to work well, it may not be the best possible choice in the
semi-supervised situation because of its unsupervised nature. In this paper, we propose an
alternative semi-supervised dimensionality reduction method. Our basic idea is to smoothly
bridge LFDA and PCA so that our reliance on the global structure of unlabeled samples and
information brought by (a small number of) labeled samples can be controlled. We show
experimentally that the proposed method, which we refer to as semi-supervised LFDA (SELF),
compares favorably with other methods. Note that SELF maintains the same computational
advantage of LFDA and PCA, i.e., a global solution can be analytically computed based
on eigen-decomposition. Therefore, SELF is still computationally as efficient as LFDA and
PCA.
The rest of this paper is organized as follows. In Sect. 2, the linear dimensionality
reduction problem addressed in this paper is formulated and some mathematical facts used in the
following sections are briefly summarized. In Sect. 3, existing supervised and unsupervised
dimensionality reduction methods are reviewed in a systematic and unified manner. This
unified view will be the foundation for developing our new method in the following section.
Those who are familiar with the existing methods and interested in immediately looking at
the new method may choose to skip the review materials provided in Sect. 3. In Sect. 4, we
propose the new semi-supervised dimensionality reduction method SELF and show its
properties. Section 5 is devoted to experiments showing the usefulness of the proposed approach.
Finally, in Sect. 6, we conclude with a discussion on possible future directions.
Many dimensionality reduction techniques developed so far involve an optimization
problem of the following form:
CT )1 .
Roughly speaking, B encodes the quantity that we want to increase (e.g., between-class
separability), and C corresponds to the quantity that we want to decrease (e.g., within-class
scatter). In the next section, we show how B and C are designed in some specific cases. Note
that the same solution T (OPT) can also be obtained as follows (see e.g., Fukunaga 1990):
CT = I r ,
where I r is the identity matrix on Rr and det() denotes the determinant of a matrix.
d
Let {k }k=1 be the generalized eigenvectors associated with the generalized eigenvalues
{k }kd=1 of the following generalized eigenvalue problem:
In this section, we formulate the linear dimensionality reduction problem and give some
mathematical background.
Let xi Rd (i = 1, 2, . . . , n) be d -dimensional sample vectors and let X Rdn be the
matrix of all samples:
X := (x1|x2| |xn).
Let z Rr (1 r d) be a low-dimensional representation of a high-dimensional sample
x Rd , where r is the dimensionality of the reduced space. For the moment, we focus on
linear dimensionality reduction, i.e., using a transformation matrix T Rdr , an embedded
representation z of the sample x is obtained as
z = T x,
where denotes the transpose of a matrix or a vector. Later, we extend our discussion to
cases where the mapping from x to z is non-linear.
2.2 Generalized eigenvalue problem
2 Preliminaries
2.1 Formulation
We assume that the generalized eigenvalues are sorted in descending order as
k Ck = 0.
and the generalized eigenvectors are normalized as
k Ck = 1
for k = 1, 2, . . . , d.
Note that this normalization is often carried out automatically by an eigen-solver. (...truncated)