An answer summarization method based on keyword extraction
BIO Web of Conferences
An answer extraction
Qiaoqing Fan 0
Yu Fang 0
0 Department of Computer Science, Tongji University , Shanghai , China
In order to reduce the redundancy of answer summary generated from community q&a dataset without topic tags, we propose an answer summarization algorithm based on keyword extraction. We combine tf-idf with word vector to change the influence transferred ratio equation in TextRank. And then during summarizing, we take the ratio of the number of sentences containing any keyword to the total number of candidate sentences as an adaptive factor for AMMR. Meanwhile we reuse the scores of keywords generated by TextRank as a weight factor for sentence similarity computing. Experimental results show that the proposed answer summarization is better than the traditional MMR and AMMR.
1 Introduction
ICMSB2016
2 Related works
Research on keyword extraction has been ongoing for years. Mihalcea R et al. proposed TextRank
based on PageRank. And then on the basis of TextRank, Li et al. [3] explored the use of tags for
improving the performance of webpage keyword extraction task, Xia et al. [4] used the influence of
word location to construct a word graph and calculate the probability transition matrix. In addition,
there are some machine learning methods for keyword extraction, typically LDA [5]. The biggest
difference between TextRank and LDA is that training corpus ahead is not required. Although LDA
analyse the linkages between words form semantic level, it is required to retrain the model when new
corpus comes. So TextRank is more simple in use. TextRank select only lexical units of a certain part
of speech to construct an influence-evenly-transferred word graph. According to Matthew Effecta, a
word should get much more attention from neighbouring synonyms to highlight the importance of
itself, so do high frequency words. Therefore, this paper will take term frequency and word sense into
account for keyword extraction task.
The summarization is defined into two ways according to the number of documents to be
summarized, respectively are Single Document Summarization and Multiple Document
Summarization. Extractive summarization and abstractive summarization approach are used [6].
The research object of this paper is Extractive Multiple Document Summarization.
Answer summarization was firstly proposed by Liu et al. [7]. Yin et al. developed a hierarchical
clustering method to group similar questions and a ranking based summarization for representing an
answer [8]. Wang et at. proposed a topic-centric answer summarization method called Adaptive
Maximum Marginal Relevance(AMMR) in 2013 [2]. AMMR takes the e-mail headers and webpage
tags as topic information to automatically adjust the weighting of topic relativity and redundancy
when picking relevant sentences from candidate answers. However, not all QA datasets are labelled
with such tags or have headers. Therefore, this paper proposes a keyword-extraction-centric answer
summarization method based on Wang’ work.
3 Keyword extractions
TextRank is derived from PageRank [9]. It splits text into a number of units. In keyword extraction
the units are a set of words. All these words that are added into a graph and an edge are added between
those words that co-occur within a window of words. After the graph is constructed, the ranking
algorithm described in [1] is run on the graph for several iterations until it converges. Once a final
score is obtained for each unit, select the top units with highest scores as keywords.
However, TextRank only construct an influence-evenly-transferred word graph. Considering that
Figure 1 is a word graph of a certain document. In TextRank, the influence word contributes to word
is one third of because three nodes are linked with . In fact, one word should transfer much more
influence to the other if the other word is closer to it in semantic space. So we need an method to
evaluate the semantic relationship between these words.
https://en.wikipedia.org/wiki/Matthew_effect
f
b
g
Word2vec provides such a kind of evaluation method. It was proposed by Mikolov et al. in 2013
to compute continuous vector representations of words from very large datasets [10,11]. The vectors
can be used to compute semantic similarity since it captures the context of a word during the process
of training.
Formally, let
be the word set of document . is the number of words in
. is the word set extracted from , including nouns, verbs and words in the user dictionary.
is the numer of words in . Each word in can be represented in -dimensional vector after
training the corpus with word2vec. Let denotes the vector
of
. Then the semantic similarity of
and
can be measured by cosine similarity:
And the influence transferred ratio euqation in TextRank is re-weighted by the semantic similarity and
re-normalized as shown below:
Let be the index set of words that points to(successors) in word graph, is the
influence score of . In ad (...truncated)