TOPIC MODEL IMPLEMENTATION TO FIND RELATED DOCUMENTS IN CORPORATE ARCHIVES IN REAL LIFE: “A CASE SCENARIO ON KNOWLEDGE RETRIEVAL” (pdf)

Article PDF cannot be displayed. You can download it here:

http://dergipark.gov.tr/download/article-file/257109

TOPIC MODEL IMPLEMENTATION TO FIND RELATED DOCUMENTS IN CORPORATE ARCHIVES IN REAL LIFE: “A CASE SCENARIO ON KNOWLEDGE RETRIEVAL”

INTERNATIONAL JOURNAL OF eBUSINESS AND eGOVERNMENT STUDIES Vol 5, No 1, 2013 ISSN: 2146-0744 (Online) TOPIC MODEL IMPLEMENTATION TO FIND RELATED DOCUMENTS IN CORPORATE ARCHIVES IN REAL LIFE: “A CASE SCENARIO ON KNOWLEDGE RETRIEVAL” İhsan Tolga Medeni Çankaya University, METU Specialist, PhD Student Tunç Durmuş Medeni Yıldırım Beyazıt University, Turksat, METU, Çankaya University Instructor, Senior Specialist ─Abstract ─ Today’s organizations were mostly built over their documents. These documents are very crucial sources of knowledge. Even they know the existence of these documents, most of the time, it is nearly impossible to extract captive knowledge inside. In these conditions, organizations choose re-prepare same document again rather than finding proper documents in the archives. On the other hand, finding these documents would save precious time and decrease redundancy of the work. Topic model idea basically focuses on extraction of knowledge from these types of documents. In this study, our aim is to give a summary of Topic Model research and try to explain latest model concept over an imaginary case scenario. Key Words: Topic Model, Knowledge Extraction, Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (pLSA) ,Latent Dirichlet Allocation (LDA) JEL Classification: C60 , D83 98 INTERNATIONAL JOURNAL OF eBUSINESS AND eGOVERNMENT STUDIES Vol 5, No 1, 2013 ISSN: 2146-0744 (Online) 1. INTRODUCTION When an unexpected condition occurs in the organizations, it is needed to find solutions in their organizations flow with creating high volume of documents. In reality, probably this organization could face similar problem in its life, and there is already a set of information and documents exists for solving the condition. As pointed out in Davenport and Prousark’s Working Knowledge, if the organization could have proper knowledge structure independent from the individual workers’ experience, which could easily lost with the end of this workers carrier in the organization, they would not need to redundant jobs again and again. (Davenport, 2001) One of the solutions to this problem is the topic model concept. Topic modeling concept defined for the need of extracting information without inclusion of any user queries. (Deerwester,1999) Topic models allow presentation of documents as collections of topics rather than collections of words (Zheng,2009). Working on this concept starts with 90s with the vector space model and continues with latent semantic analysis/indexing (LSA, LSI), probabilistic LSA (pLSA) and Latent Dirichlet Allocation (LDA). Under different research areas from medical science to software engineering, topic models have been used with different names such as Information Retrieval (IR), dimensional reduction, word matching etc. from text mining to the image processing. In this paper, with focusing on these methods, a general summary will be given. After then, a possible application scenario from organizational application perspective will be given. This paper will be ended with the conclusion part. 2. TOPIC MODEL EVALUTION 2.1. Vector Space model to LSI/LSA In 1990, Deerwester and his colleagues proposed an approach for automatic indexing and retrieval. (Deerwester,1999) According to them, the existed techniques based on user queries just applied to match words of the user queries. However this approach just does not include any evidential information about the 99 INTERNATIONAL JOURNAL OF eBUSINESS AND eGOVERNMENT STUDIES Vol 5, No 1, 2013 ISSN: 2146-0744 (Online) meaning and concept of the document and topic. Taking statistical approach on one hand, latent semantic index (LSI) analysis (LSA), a semantic space was built to show association of terms and documents. This concept was created over vector space model. Under this concept, text documents represented as vector of terms and relationships between documents and terms represented in a matrix. Cosine of angle in between vectors represents similarity between two documents. (Poshyvanyk,2006) The work of Deerwester differs from simple vector space models with taking the concepts of synonym and polysemy into account. Polysemy is defined as carrying more than one meaning in a simple word.( Deerwester,1999) For example orange, it could be considered as a fruit in one document and on another, it could be taken as a color. On the other hand, synonym referred as ability to refer a concept with more than one single word. Auto and car words could be given as example to this concept. They all could be used to point one meaning, the automobile. Even it was given as supportive to the synonym and polysemy, in reality result of LSI does not show better performance of polysemy when compared to synonymy. (Lukins,2010) Reaching a satisfactory topic set also another problem related with the LSA. To address these problems, (Hoffmann, 1999) proposed probabilistic Latent Semantic Analysis, pLSA. 2.2. LSI/LSA to pLSA In pLSA, each term modeled over a set of multinominal variables according to related documents. This model build based on probabilistic distribution of terms in each document. Following with this construct, pLSA shows improvements over LSA. (Hoffman, 1999) (Blei, Ng and Jordan,2003) With its improvements, pLSA shows promising results with its document oriented linearly growing model. However this model also introduced an overfitting problem. (Girolami,2003) This problem occurs when new documents are introduced to the previously trained structure. pLSA structure tend to find topics in new documents according to estimated, previously distributed documents. (Blei,2003) (Wei,2006). Similar to pLSA, LDA was introduced to solve this problem with the work of Blei in 2003. (Blei,2003) The studies showed undeniable result to support LDA implementation when comparing [20] results of pLSA, LDA on the same corpora. 100 INTERNATIONAL JOURNAL OF eBUSINESS AND eGOVERNMENT STUDIES Vol 5, No 1, 2013 ISSN: 2146-0744 (Online) 2.3. pLSA to LDA Latent Dirichlet Allocation, provides a means of fitting the Diriclet parameter with a given document set. In LDA, every document is taken as finite mixture words contains of set of multiple topics. (Blei,2003)( Zheng,2006). In this probabilistic model, word, document and corpus are the main concepts. In (Blei,2003) these concepts are defined as follows; • A word is the unit basically defines an item from a dictionary indexed from 1 to V. • A document is the combination of N words and represented by w=(w1 , w2 ,…, wm ) • A corpus is a document set that is represented by D={w1 , w2 ,…, wm } Figure-1: Graphical Model Representation of LDA from Source: Blei and Jordan,2003 In figure 1, the boxes are represented as plates. The outer plate represents documents, while inner plate represents the choice of topics and words within documents. LDA follows the generative process for each document w in a corpus D. 101 INTERNATIONAL JOURNAL OF eBUSINESS AND eGOVERN (...truncated)