TOPIC MODEL IMPLEMENTATION TO FIND RELATED DOCUMENTS IN CORPORATE ARCHIVES IN REAL LIFE: “A CASE SCENARIO ON KNOWLEDGE RETRIEVAL”
INTERNATIONAL JOURNAL OF eBUSINESS AND eGOVERNMENT STUDIES
Vol 5, No 1, 2013 ISSN: 2146-0744 (Online)
TOPIC MODEL IMPLEMENTATION TO FIND RELATED
DOCUMENTS IN CORPORATE ARCHIVES IN REAL LIFE: “A CASE
SCENARIO ON KNOWLEDGE RETRIEVAL”
İhsan Tolga Medeni
Çankaya University, METU
Specialist, PhD Student
Tunç Durmuş Medeni
Yıldırım Beyazıt University, Turksat, METU, Çankaya University
Instructor, Senior Specialist
─Abstract ─
Today’s organizations were mostly built over their documents. These documents
are very crucial sources of knowledge. Even they know the existence of these
documents, most of the time, it is nearly impossible to extract captive knowledge
inside. In these conditions, organizations choose re-prepare same document again
rather than finding proper documents in the archives. On the other hand, finding
these documents would save precious time and decrease redundancy of the work.
Topic model idea basically focuses on extraction of knowledge from these types
of documents. In this study, our aim is to give a summary of Topic Model
research and try to explain latest model concept over an imaginary case scenario.
Key Words: Topic Model, Knowledge Extraction, Latent Semantic Analysis
(LSA), Probabilistic Latent Semantic Analysis (pLSA) ,Latent Dirichlet Allocation
(LDA)
JEL Classification: C60 , D83
98
INTERNATIONAL JOURNAL OF eBUSINESS AND eGOVERNMENT STUDIES
Vol 5, No 1, 2013 ISSN: 2146-0744 (Online)
1. INTRODUCTION
When an unexpected condition occurs in the organizations, it is needed to find
solutions in their organizations flow with creating high volume of documents. In
reality, probably this organization could face similar problem in its life, and there
is already a set of information and documents exists for solving the condition. As
pointed out in Davenport and Prousark’s Working Knowledge, if the organization
could have proper knowledge structure independent from the individual workers’
experience, which could easily lost with the end of this workers carrier in the
organization, they would not need to redundant jobs again and again. (Davenport,
2001) One of the solutions to this problem is the topic model concept.
Topic modeling concept defined for the need of extracting information without
inclusion of any user queries. (Deerwester,1999) Topic models allow presentation
of documents as collections of topics rather than collections of words
(Zheng,2009). Working on this concept starts with 90s with the vector space
model and continues with latent semantic analysis/indexing (LSA, LSI),
probabilistic LSA (pLSA) and Latent Dirichlet Allocation (LDA). Under different
research areas from medical science to software engineering, topic models have
been used with different names such as Information Retrieval (IR), dimensional
reduction, word matching etc. from text mining to the image processing.
In this paper, with focusing on these methods, a general summary will be given.
After then, a possible application scenario from organizational application
perspective will be given. This paper will be ended with the conclusion part.
2. TOPIC MODEL EVALUTION
2.1. Vector Space model to LSI/LSA
In 1990, Deerwester and his colleagues proposed an approach for automatic
indexing and retrieval. (Deerwester,1999) According to them, the existed
techniques based on user queries just applied to match words of the user queries.
However this approach just does not include any evidential information about the
99
INTERNATIONAL JOURNAL OF eBUSINESS AND eGOVERNMENT STUDIES
Vol 5, No 1, 2013 ISSN: 2146-0744 (Online)
meaning and concept of the document and topic. Taking statistical approach on
one hand, latent semantic index (LSI) analysis (LSA), a semantic space was built
to show association of terms and documents. This concept was created over vector
space model. Under this concept, text documents represented as vector of terms
and relationships between documents and terms represented in a matrix. Cosine of
angle in between vectors represents similarity between two documents.
(Poshyvanyk,2006) The work of Deerwester differs from simple vector space
models with taking the concepts of synonym and polysemy into account.
Polysemy is defined as carrying more than one meaning in a simple word.(
Deerwester,1999) For example orange, it could be considered as a fruit in one
document and on another, it could be taken as a color. On the other hand,
synonym referred as ability to refer a concept with more than one single word.
Auto and car words could be given as example to this concept. They all could be
used to point one meaning, the automobile. Even it was given as supportive to the
synonym and polysemy, in reality result of LSI does not show better performance
of polysemy when compared to synonymy. (Lukins,2010) Reaching a satisfactory
topic set also another problem related with the LSA.
To address these problems, (Hoffmann, 1999) proposed probabilistic Latent
Semantic Analysis, pLSA.
2.2. LSI/LSA to pLSA
In pLSA, each term modeled over a set of multinominal variables according to
related documents. This model build based on probabilistic distribution of terms
in each document. Following with this construct, pLSA shows improvements over
LSA. (Hoffman, 1999) (Blei, Ng and Jordan,2003)
With its improvements, pLSA shows promising results with its document oriented
linearly growing model. However this model also introduced an overfitting
problem. (Girolami,2003) This problem occurs when new documents are
introduced to the previously trained structure. pLSA structure tend to find topics
in new documents according to estimated, previously distributed documents.
(Blei,2003) (Wei,2006). Similar to pLSA, LDA was introduced to solve this
problem with the work of Blei in 2003. (Blei,2003) The studies showed
undeniable result to support LDA implementation when comparing [20] results of
pLSA, LDA on the same corpora.
100
INTERNATIONAL JOURNAL OF eBUSINESS AND eGOVERNMENT STUDIES
Vol 5, No 1, 2013 ISSN: 2146-0744 (Online)
2.3. pLSA to LDA
Latent Dirichlet Allocation, provides a means of fitting the Diriclet parameter
with a given document set.
In LDA, every document is taken as finite mixture words contains of set of
multiple topics. (Blei,2003)( Zheng,2006). In this probabilistic model, word,
document and corpus are the main concepts. In (Blei,2003) these concepts are
defined as follows;
• A word is the unit basically defines an item from a dictionary indexed
from 1 to V.
•
A document is the combination of N words and represented by w=(w1 , w2
,…, wm )
•
A corpus is a document set that is represented by D={w1 , w2 ,…, wm }
Figure-1:
Graphical Model Representation of LDA from
Source: Blei and Jordan,2003
In figure 1, the boxes are represented as plates. The outer plate represents
documents, while inner plate represents the choice of topics and words within
documents. LDA follows the generative process for each document w in a corpus
D.
101
INTERNATIONAL JOURNAL OF eBUSINESS AND eGOVERN (...truncated)