Large-scale latent semantic analysis (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.3758%2Fs13428-010-0050-z.pdf

Large-scale latent semantic analysis

Andrew McGregor Olney 0 0 A. M. Olney Institute for Intelligent Systems , 365 Innovation Drive, Suite 303, Memphis, TN 38152, USA 1 ) Department of Psychology, University of Memphis , 365 Innovation Drive, Suite 303, Memphis, TN 38152, USA Latent semantic analysis (LSA) is a statistical technique for representing word meaning that has been widely used for making semantic similarity judgments between words, sentences, and documents. In order to perform an LSA analysis, an LSA space is created in a twostage procedure, involving the construction of a word frequency matrix and the dimensionality reduction of that matrix through singular value decomposition (SVD). This article presents LANSE, an SVD algorithm specifically designed for LSA, which allows extremely large matrices to be processed using off-the-shelf computer hardware. - Latent semantic analysis (LSA) is a statistical technique for representing world knowledge (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Landauer, Foltz, & Laham, 1998). Since its discovery, LSA has been heavily used in both the psychological and computational linguistics communities. In psychological research, LSA has been used to approximate vocabulary acquisition in children, grade essays, match students with optimal texts for learning, predict text coherence, make humanlike text similarity judgments, take subject matter multiple-choice tests with human performance, mirror lexical priming, and understand student input during tutorial dialogue, among many other things (Foltz, Kintsch, & Landauer, 1998: Graesser, VanLehn, Rose, Jordan, & Harter, 2001; Landauer & Dumais, 1997; Landauer et al., 1998; Landauer, McNamara, Dennis, & Kintsch, 2007; Rehder et al., 1998; Wolfe et al., 1998). In computational linguistics, LSA has been used for text segmentation, speech recognition, entailment detection, summarization, and information retrievalagain, among many other things (Bellegarda, 2000; Coccaro & Jurafsky, 1998; Deerwester et al., 1990; Deng & Khudanpur, 2003; Dumais, 1991; Foltz et al., 1998; Olney, 2007a; Olney & Cai, 2005a, 2005b). The duality of use across these communities underlines the multiple viewpoints surrounding LSA. On the one hand, LSA can be seen as a valuable tool for imbuing computers with some notion of semantic relatedness, and on the other, LSA can be seen as a computational model of cognition with wide-ranging implications for cognitive theory (Landauer et al., 2007). The fact that LSA enjoys wide use in many communities is a testament to the elegance of its model and the simplicity of its use. Conceptually, LSA maps words onto points in a space. Similar words tend to be nearby in this space, while unrelated words are more distant. Since each point in this space can be represented as a vector, representations for documents can be created by summing the vector representations of their constituent words. The vector addition property has two important consequences. First, any size collection of words can be compared with any other size collection in the same way that two individual words can be compared with each other. Second, the representation of any collection of words has the same dimensionality as a single word in the collection; both are a vector of fixed size. Using this conceptual description as a background, we now describe the process of LSA space creation in more detail. At a high level, creating LSA spaces involves two steps: construction of a termdocument matrix and the singular value decomposition of that matrix. A term document matrix is created by counting term (or word) frequencies across a collection of documents. In the matrix, the value at row i column j is the number of times term i appeared in document j. Weighting schemes can further be applied to this matrix to improve task performance (Dumais, 1991). Several observations can be made about the termdocument matrix for natural languages such as English. First, the matrix will necessarily be quite sparse, since not all words occur in all documents. Thus, for any given column of the matrix corresponding to a document, many of its entries will be zero. Moreover, the matrix is likely to be rectangular in shape, since there is no constraint that the number of words should equal the number of documents. The second step of LSA is singular value decomposition (SVD). SVD is a fundamental technique in linear algebra. SVD is also an unsupervised method of dimensionality reduction that is optimal in the least squares sense. To see why, consider the definition of SVD: A U V T ; 1 where U and V are orthonormal matrices and = diag (1,..., n) and (1 ... n 0). The i are the singular values of the matrix A. A theorem by Eckart and Young (1936) establishes the dimensionality reduction property of SVD. The theorem states that a rank k approximation of the original rank n matrix may be created by setting singular values k + 1 q n to zero. Moreover, the theorem states that this reduced rank matrix Ak has minimal distance to A in terms of the Frobenius norm: In other words, by choosing a smaller number of dimensions, the resulting matrix Ak is an optimal approximation of the original matrix A in the least squares sense. For this reason, SVD can be a useful tool for dimensionality reduction and noise elimination; since the dimensions retained account for most of the variance in the matrix, the eliminated dimensions can be considered noise. Thus, the theorem states that Equation 1 provides the definition of SVD but says nothing about how to calculate it. Indeed, calculation of the SVD is the most complex and challenging stage of creating an LSA space. Although a great deal of research has established multiple methods for calculating SVD (Bai, Demmel, Dongarra, Ruhe, & van Der Vorst, 2000), LSA research to date has focused on a single method: the Lanczos algorithm with selective reorthogonalization (LANSO; Martin & Berry, 2007). For reasons discussed in detail below, traditional SVD algorithms such as LANSO, despite their speed, require large amounts of random access memory proportional to the size of the space being created. The size limitation has restricted the kinds of LSA spaces that have been made to date. For example, bigram spaces potentially contain N2 rows, where N is the number of word types in the original corpus. Such large spaces require either a computer with a very large quantity of random access memory or an alternative algorithm without such a size limitation. In the remainder of this article, we outline an alternative algorithm, called the Lanczos algorithm, for semantic spaces (LANSE). Our algorithm is specifically designed for large-scale LSA spaces and has previously been used in spaces with millions of bigram terms (Olney, 2007b, 2009), as well as in traditional spaces from large collections like Wikipedia (Willits, DMello, Duran, & Olney 2007). The Lanczos algorithm In this section, we outline the Lanczos algorithm, which is the b (...truncated)