An index-based algorithm for fast on-line query processing of latent semantic analysis

PLOS ONE, Dec 2019

Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0177523&type=printable

An index-based algorithm for fast on-line query processing of latent semantic analysis

May An index-based algorithm for fast on-line query processing of latent semantic analysis Mingxi Zhang 0 1 Pohan Li 1 Wei Wang 1 0 College of Communication and Art Design, University of Shanghai for Science and Technology , Shanghai , China , 2 School of Computer Science, Fudan University , Shanghai , China 1 Editor: Quan Zou, Tianjin University , CHINA Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm. - Data Availability Statement: All relevant data are within the paper, and the DBLP data are available from the DBLP website (http://dblp.uni-trier.de/). Funding: This work was supported by Natural Science Foundation of Shanghai grant 16ZR14228, http://www.stcsm.gov.cn/; Innovation Program of Shanghai Municipal Education Commission grants 15ZZ073 and 15ZZ074, http://www.shmec.gov.cn/; and Training Project of University of Shanghai for Science and Technology grant 16HJPY-QN04, http://www.usst.edu.cn/. The funders had no role in study design, data collection and analysis, Introduction Many real data sets could be grouped as documents, including as web pages, literature and product profiles. With such data sets becoming massive and diverse, there is a need for designing algorithmic tools and developing applications to discover the underlying relationship from the data. Consider an example of the document search in a dataset, even though a document is on precisely the same topic to a input query of keywords, it may not be searched when its contained terms are different to the input keywords. In previous work, there are some semantic approaches that can be used finding the documents whose semantic is similar to the query of decision to publish, or preparation of the manuscript. keywords, e.g., Latent Semantic Analysis (LSA) [1±4], Probabilistic LSA (PLSA) [ 5, 6 ], Latent Dirichlet Allocation (LDA) [7±10] and latent factorization model (LFM) [ 11, 12 ]. Among these approaches, LSA is a well-known representative which has been widely applied to various research fields, including document retrieval [ 13, 14 ], query expansion [ 15, 16 ], data extraction [ 17, 18 ] and text classification [ 19, 20 ]. For improving the performance of these applications, LSA provides an effective function for searching the similar documents for a given query of keywords. Specifically, LSA represents the relationship between documents and terms by a term-document matrix that is further decomposed into a product of three other matrices by the singular value decomposition (SVD) [ 1, 3, 4 ]. SVD is the mathematical tool behind LSA and some applications including association prediction [21], similarity computation [ 22, 23 ], clustering [ 24, 25 ], images analysis [26] and collaborative filtering [ 27, 28 ]. For the given query, LSA transforms it into a pseudo document vector and computes the similarities between query and candidate documents over the SVD result of the term-document matrix. LSA has also been applied to other research fields recently, including social data analysis [ 29, 30 ], collaborative filtering [31±33], sign language translation [34] and gene sequence analysis [35±40]. For example, in the field of social data analysis, [ 29 ] adopted LSA for producing better annotated video clip in social multimedia data. [ 30 ] measured the sem (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0177523&type=printable

Mingxi Zhang, Pohan Li, Wei Wang. An index-based algorithm for fast on-line query processing of latent semantic analysis, PLOS ONE, 2017, Volume 12, Issue 5, DOI: 10.1371/journal.pone.0177523