Leveraging semantic resources in diversified query expansion (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs11280-017-0468-7.pdf

Leveraging semantic resources in diversified query expansion

World Wide Web DOI 10.1007/s11280-017-0468-7 Leveraging semantic resources in diversified query expansion Adit Krishnan1 · Deepak P.2 · Sayan Ranu3 · Sameep Mehta4 Received: 13 February 2017 / Revised: 25 April 2017 / Accepted: 9 May 2017 © The Author(s) 2017. This article is an open access publication Abstract A search query, being a very concise grounding of user intent, could potentially have many possible interpretations. Search engines hedge their bets by diversifying top results to cover multiple such possibilities so that the user is likely to be satisfied, whatever be her intended interpretation. Diversified Query Expansion is the problem of diversifying query expansion suggestions, so that the user can specialize the query to better suit her intent, even before perusing search results. In this paper, we consider the usage of semantic resources and tools to arrive at improved methods for diversified query expansion. In particular, we develop two methods, those that leverage Wikipedia and pre-learnt distributional word embeddings respectively. Both the approaches operate on a common three-phase framework; that of first taking a set of informative terms from the search results of the initial query, then building a graph, following by using a diversity-conscious node ranking to This article belongs to the Topical Collection: Special Issue on Web Information Systems Engineering Guest Editors: Wojciech Cellary, Hua Wang, and Yanchun Zhang Deepak P. Adit Krishnan Sayan Ranu Sameep Mehta 1 Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA 2 Queen’s University Belfast, Northern Ireland, UK 3 Department of Computer Science and Engineering, IIT Delhi, Hauz Khas New Delhi, 110016, India 4 IBM-Research, New Delhi, 110070, India World Wide Web prioritize candidate terms for diversified query expansion. Our methods differ in the second phase, with the first method Select-Link-Rank (SLR) linking terms with Wikipedia entities to accomplish graph construction; on the other hand, our second method, SelectEmbed-Rank (SER), constructs the graph using similarities between distributional word embeddings. Through an empirical analysis and user study, we show that SLR ourperforms state-of-the-art diversified query expansion methods, thus establishing that Wikipedia is an effective resource to aid diversified query expansion. Our empirical analysis also illustrates that SER outperforms the baselines convincingly, asserting that it is the best available method for those cases where SLR is not applicable; these include narrow-focus search systems where a relevant knowledge base is unavailable. Our SLR method is also seen to outperform a state-of-the-art method in the task of diversified entity ranking. Keywords Query expansion · Diversification · Semantic search · Wikipedia · Entity ranking 1 Introduction Users of a search system may choose the same initial search query for varying information needs. This is most evident in the case of ambiguous queries that are estimated to make up one-sixth of all queries [30]. Consider the example of a user searching with the query python. It may be observed that this is a perfectly reasonable starting query for a zoologist interested in learning about the species of large non-venomous reptiles,1 or for a comedy-enthusiast interested in learning about the British comedy group Monty Python.2 However, search results would most likely be dominated by pages relating the programming language,3 that being the dominant interpretation (aka aspect) in the Web. Search Result Diversification (SRD) [5, 37] refers to the task of selecting and/or re-ranking search results so that many aspects of the query are covered in the top results; this would ensure that the zoologist and comedy-fan in our example are not disappointed with the results. If the British group is to be covered among the top results in a re-ranking based SRD approach for our example, the approach should consider documents that are as deep in the un-diversified ranked list as the rank of the first result that relates to the group. In our exploration, we could not find a result relating to Monty Python among the first five pages of search results for python on Bing. Such difficulties in covering long tail aspects, as noted in [2], led to research interest in a slightly different task attacking the same larger goal, that of Diversified Query Expansion (DQE). Note that techniques to ensure coverage of diverse aspects among the top results are relevant for apparently unambiguous queries too, though the need is more pronounced in inherently ambiguous ones. For an unambiguous query: python programming, there are many aspects based on whether the user is interested in books, software or courses. Similarly, for another seemingly unambiguous query, india, the aspects of interest could include railways, maps, news and cricket. DQE is the task of identifying a (small) set of terms (i.e., words) to extend the search query with, wherein the extended search query could be used in the search system to retrieve results covering a diverse set of aspects. For our python example, desirable top DQE 1 https://en.wikipedia.org/wiki/Pythonidae 2 https://en.wikipedia.org/wiki/Monty Python 3 https://en.wikipedia.org/wiki/Python (programming language) World Wide Web expansion terms would include those relating to the programming language aspect such as language and programming as well as those relating to the reptile-aspect such as pythonidae and reptile. In existing work, the extension terms have been identified from sources such as corpus documents [34], query logs [21], external ontologies [2, 3] or the results of the initial query [34]. The aspect-affinity of each term is modeled either explicitly [21, 34] or implicitly [2] followed by selection of a subset of candidate words using the Maximum Marginal Relevance (MMR) principle [5]. This ensures that terms related to many aspects find a place in the extended set. Diversified Entity Recommendations (DER) is the analogous problem where the output of interest is a ranked list of entities from a knowledge base such that diverse query aspects are covered among the top entities. In this paper, we consider the diversified query expansion problem and develop a three phase framework to exploit semantic resources for the problem. We use the framework to develop methods focusing on Wikipedia and pre-learned word embeddings respectively, leading to techniques that we call Select-Link-Rank (SLR) and Select-Embed-Rank (SER). Further, we outline how SLR can address diversified entity ranking, and illustrate that SER results can also be mapped to a corresponding DER result set. Extension from WISE 2016 Paper In our WISE 2016 paper [18], we had proposed the SLR method. In this paper, we generalize SLR into a framework, and also develop another method based on the framework, SER, one targeted at exploiting pre- (...truncated)