Large-scale latent semantic analysis
Andrew McGregor Olney
0
0
A. M. Olney Institute for Intelligent Systems
, 365 Innovation Drive, Suite 303, Memphis,
TN 38152, USA
1
) Department of Psychology, University of Memphis
, 365 Innovation Drive, Suite 303, Memphis,
TN 38152, USA
Latent semantic analysis (LSA) is a statistical technique for representing word meaning that has been widely used for making semantic similarity judgments between words, sentences, and documents. In order to perform an LSA analysis, an LSA space is created in a twostage procedure, involving the construction of a word frequency matrix and the dimensionality reduction of that matrix through singular value decomposition (SVD). This article presents LANSE, an SVD algorithm specifically designed for LSA, which allows extremely large matrices to be processed using off-the-shelf computer hardware.
-
Latent semantic analysis (LSA) is a statistical technique for
representing world knowledge (Deerwester, Dumais,
Furnas, Landauer, & Harshman, 1990; Landauer, Foltz, &
Laham, 1998). Since its discovery, LSA has been heavily
used in both the psychological and computational
linguistics communities. In psychological research, LSA has been
used to approximate vocabulary acquisition in children,
grade essays, match students with optimal texts for
learning, predict text coherence, make humanlike text
similarity judgments, take subject matter multiple-choice
tests with human performance, mirror lexical priming, and
understand student input during tutorial dialogue, among
many other things (Foltz, Kintsch, & Landauer, 1998:
Graesser, VanLehn, Rose, Jordan, & Harter, 2001;
Landauer & Dumais, 1997; Landauer et al., 1998;
Landauer, McNamara, Dennis, & Kintsch, 2007; Rehder
et al., 1998; Wolfe et al., 1998). In computational
linguistics, LSA has been used for text segmentation,
speech recognition, entailment detection, summarization,
and information retrievalagain, among many other things
(Bellegarda, 2000; Coccaro & Jurafsky, 1998; Deerwester
et al., 1990; Deng & Khudanpur, 2003; Dumais, 1991;
Foltz et al., 1998; Olney, 2007a; Olney & Cai, 2005a,
2005b). The duality of use across these communities
underlines the multiple viewpoints surrounding LSA. On
the one hand, LSA can be seen as a valuable tool for
imbuing computers with some notion of semantic
relatedness, and on the other, LSA can be seen as a computational
model of cognition with wide-ranging implications for
cognitive theory (Landauer et al., 2007).
The fact that LSA enjoys wide use in many communities
is a testament to the elegance of its model and the
simplicity of its use. Conceptually, LSA maps words onto
points in a space. Similar words tend to be nearby in this
space, while unrelated words are more distant. Since each
point in this space can be represented as a vector,
representations for documents can be created by summing
the vector representations of their constituent words. The
vector addition property has two important consequences.
First, any size collection of words can be compared with
any other size collection in the same way that two
individual words can be compared with each other. Second,
the representation of any collection of words has the same
dimensionality as a single word in the collection; both are a
vector of fixed size.
Using this conceptual description as a background, we
now describe the process of LSA space creation in more
detail. At a high level, creating LSA spaces involves two
steps: construction of a termdocument matrix and the
singular value decomposition of that matrix. A term
document matrix is created by counting term (or word)
frequencies across a collection of documents. In the matrix,
the value at row i column j is the number of times term i
appeared in document j. Weighting schemes can further be
applied to this matrix to improve task performance
(Dumais, 1991). Several observations can be made about
the termdocument matrix for natural languages such as
English. First, the matrix will necessarily be quite sparse,
since not all words occur in all documents. Thus, for any
given column of the matrix corresponding to a document,
many of its entries will be zero. Moreover, the matrix is
likely to be rectangular in shape, since there is no constraint
that the number of words should equal the number of
documents.
The second step of LSA is singular value decomposition
(SVD). SVD is a fundamental technique in linear algebra.
SVD is also an unsupervised method of dimensionality
reduction that is optimal in the least squares sense. To see
why, consider the definition of SVD:
A U V T ; 1
where U and V are orthonormal matrices and = diag
(1,..., n) and (1 ... n 0). The i are the singular
values of the matrix A.
A theorem by Eckart and Young (1936) establishes the
dimensionality reduction property of SVD. The theorem
states that a rank k approximation of the original rank n
matrix may be created by setting singular values k + 1 q
n to zero. Moreover, the theorem states that this reduced
rank matrix Ak has minimal distance to A in terms of the
Frobenius norm:
In other words, by choosing a smaller number of
dimensions, the resulting matrix Ak is an optimal
approximation of the original matrix A in the least squares sense. For
this reason, SVD can be a useful tool for dimensionality
reduction and noise elimination; since the dimensions
retained account for most of the variance in the matrix, the
eliminated dimensions can be considered noise.
Thus, the theorem states that
Equation 1 provides the definition of SVD but says
nothing about how to calculate it. Indeed, calculation of the
SVD is the most complex and challenging stage of creating
an LSA space. Although a great deal of research has
established multiple methods for calculating SVD (Bai,
Demmel, Dongarra, Ruhe, & van Der Vorst, 2000), LSA
research to date has focused on a single method: the
Lanczos algorithm with selective reorthogonalization
(LANSO; Martin & Berry, 2007). For reasons discussed
in detail below, traditional SVD algorithms such as
LANSO, despite their speed, require large amounts of
random access memory proportional to the size of the space
being created. The size limitation has restricted the kinds of
LSA spaces that have been made to date. For example,
bigram spaces potentially contain N2 rows, where N is the
number of word types in the original corpus. Such large
spaces require either a computer with a very large quantity
of random access memory or an alternative algorithm
without such a size limitation. In the remainder of this
article, we outline an alternative algorithm, called the
Lanczos algorithm, for semantic spaces (LANSE). Our
algorithm is specifically designed for large-scale LSA
spaces and has previously been used in spaces with millions
of bigram terms (Olney, 2007b, 2009), as well as in
traditional spaces from large collections like Wikipedia
(Willits, DMello, Duran, & Olney 2007).
The Lanczos algorithm
In this section, we outline the Lanczos algorithm, which is
the b (...truncated)