An answer summarization method based on keyword extraction

BIO Web of Conferences, Jan 2017

In order to reduce the redundancy of answer summary generated from community q&a dataset without topic tags, we propose an answer summarization algorithm based on keyword extraction. We combine tf-idf with word vector to change the influence transferred ratio equation in TextRank. And then during summarizing, we take the ratio of the number of sentences containing any keyword to the total number of candidate sentences as an adaptive factor for AMMR. Meanwhile we reuse the scores of keywords generated by TextRank as a weight factor for sentence similarity computing. Experimental results show that the proposed answer summarization is better than the traditional MMR and AMMR.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://www.bio-conferences.org/articles/bioconf/pdf/2017/01/bioconf_icmsb2017_03015.pdf

An answer summarization method based on keyword extraction

BIO Web of Conferences An answer extraction Qiaoqing Fan 0 Yu Fang 0 0 Department of Computer Science, Tongji University , Shanghai , China In order to reduce the redundancy of answer summary generated from community q&a dataset without topic tags, we propose an answer summarization algorithm based on keyword extraction. We combine tf-idf with word vector to change the influence transferred ratio equation in TextRank. And then during summarizing, we take the ratio of the number of sentences containing any keyword to the total number of candidate sentences as an adaptive factor for AMMR. Meanwhile we reuse the scores of keywords generated by TextRank as a weight factor for sentence similarity computing. Experimental results show that the proposed answer summarization is better than the traditional MMR and AMMR. 1 Introduction ICMSB2016 2 Related works Research on keyword extraction has been ongoing for years. Mihalcea R et al. proposed TextRank based on PageRank. And then on the basis of TextRank, Li et al. [3] explored the use of tags for improving the performance of webpage keyword extraction task, Xia et al. [4] used the influence of word location to construct a word graph and calculate the probability transition matrix. In addition, there are some machine learning methods for keyword extraction, typically LDA [5]. The biggest difference between TextRank and LDA is that training corpus ahead is not required. Although LDA analyse the linkages between words form semantic level, it is required to retrain the model when new corpus comes. So TextRank is more simple in use. TextRank select only lexical units of a certain part of speech to construct an influence-evenly-transferred word graph. According to Matthew Effecta, a word should get much more attention from neighbouring synonyms to highlight the importance of itself, so do high frequency words. Therefore, this paper will take term frequency and word sense into account for keyword extraction task. The summarization is defined into two ways according to the number of documents to be summarized, respectively are Single Document Summarization and Multiple Document Summarization. Extractive summarization and abstractive summarization approach are used [6]. The research object of this paper is Extractive Multiple Document Summarization. Answer summarization was firstly proposed by Liu et al. [7]. Yin et al. developed a hierarchical clustering method to group similar questions and a ranking based summarization for representing an answer [8]. Wang et at. proposed a topic-centric answer summarization method called Adaptive Maximum Marginal Relevance(AMMR) in 2013 [2]. AMMR takes the e-mail headers and webpage tags as topic information to automatically adjust the weighting of topic relativity and redundancy when picking relevant sentences from candidate answers. However, not all QA datasets are labelled with such tags or have headers. Therefore, this paper proposes a keyword-extraction-centric answer summarization method based on Wang’ work. 3 Keyword extractions TextRank is derived from PageRank [9]. It splits text into a number of units. In keyword extraction the units are a set of words. All these words that are added into a graph and an edge are added between those words that co-occur within a window of words. After the graph is constructed, the ranking algorithm described in [1] is run on the graph for several iterations until it converges. Once a final score is obtained for each unit, select the top units with highest scores as keywords. However, TextRank only construct an influence-evenly-transferred word graph. Considering that Figure 1 is a word graph of a certain document. In TextRank, the influence word contributes to word is one third of because three nodes are linked with . In fact, one word should transfer much more influence to the other if the other word is closer to it in semantic space. So we need an method to evaluate the semantic relationship between these words. https://en.wikipedia.org/wiki/Matthew_effect f b g Word2vec provides such a kind of evaluation method. It was proposed by Mikolov et al. in 2013 to compute continuous vector representations of words from very large datasets [10,11]. The vectors can be used to compute semantic similarity since it captures the context of a word during the process of training. Formally, let be the word set of document . is the number of words in . is the word set extracted from , including nouns, verbs and words in the user dictionary. is the numer of words in . Each word in can be represented in -dimensional vector after training the corpus with word2vec. Let denotes the vector of . Then the semantic similarity of and can be measured by cosine similarity: And the influence transferred ratio euqation in TextRank is re-weighted by the semantic similarity and re-normalized as shown below: Let be the index set of words that points to(successors) in word graph, is the influence score of . In ad (...truncated)


This is a preview of a remote PDF: https://www.bio-conferences.org/articles/bioconf/pdf/2017/01/bioconf_icmsb2017_03015.pdf

Qiaoqing Fan, Yu Fang. An answer summarization method based on keyword extraction, BIO Web of Conferences, 2017, 8, DOI: 10.1051/bioconf/20170803015