Leveraging semantic resources in diversified query expansion
World Wide Web
DOI 10.1007/s11280-017-0468-7
Leveraging semantic resources in diversified query
expansion
Adit Krishnan1 · Deepak P.2 · Sayan Ranu3 ·
Sameep Mehta4
Received: 13 February 2017 / Revised: 25 April 2017 / Accepted: 9 May 2017
© The Author(s) 2017. This article is an open access publication
Abstract A search query, being a very concise grounding of user intent, could potentially
have many possible interpretations. Search engines hedge their bets by diversifying top
results to cover multiple such possibilities so that the user is likely to be satisfied, whatever
be her intended interpretation. Diversified Query Expansion is the problem of diversifying query expansion suggestions, so that the user can specialize the query to better suit her
intent, even before perusing search results. In this paper, we consider the usage of semantic resources and tools to arrive at improved methods for diversified query expansion. In
particular, we develop two methods, those that leverage Wikipedia and pre-learnt distributional word embeddings respectively. Both the approaches operate on a common three-phase
framework; that of first taking a set of informative terms from the search results of the initial query, then building a graph, following by using a diversity-conscious node ranking to
This article belongs to the Topical Collection: Special Issue on Web Information Systems Engineering
Guest Editors: Wojciech Cellary, Hua Wang, and Yanchun Zhang
Deepak P.
Adit Krishnan
Sayan Ranu
Sameep Mehta
1
Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL
61801, USA
2
Queen’s University Belfast, Northern Ireland, UK
3
Department of Computer Science and Engineering, IIT Delhi, Hauz Khas New Delhi, 110016, India
4
IBM-Research, New Delhi, 110070, India
World Wide Web
prioritize candidate terms for diversified query expansion. Our methods differ in the second phase, with the first method Select-Link-Rank (SLR) linking terms with Wikipedia
entities to accomplish graph construction; on the other hand, our second method, SelectEmbed-Rank (SER), constructs the graph using similarities between distributional word
embeddings. Through an empirical analysis and user study, we show that SLR ourperforms
state-of-the-art diversified query expansion methods, thus establishing that Wikipedia is
an effective resource to aid diversified query expansion. Our empirical analysis also illustrates that SER outperforms the baselines convincingly, asserting that it is the best available
method for those cases where SLR is not applicable; these include narrow-focus search
systems where a relevant knowledge base is unavailable. Our SLR method is also seen to
outperform a state-of-the-art method in the task of diversified entity ranking.
Keywords Query expansion · Diversification · Semantic search · Wikipedia · Entity
ranking
1 Introduction
Users of a search system may choose the same initial search query for varying information needs. This is most evident in the case of ambiguous queries that are estimated to
make up one-sixth of all queries [30]. Consider the example of a user searching with the
query python. It may be observed that this is a perfectly reasonable starting query for a
zoologist interested in learning about the species of large non-venomous reptiles,1 or for a
comedy-enthusiast interested in learning about the British comedy group Monty Python.2
However, search results would most likely be dominated by pages relating the programming
language,3 that being the dominant interpretation (aka aspect) in the Web. Search Result
Diversification (SRD) [5, 37] refers to the task of selecting and/or re-ranking search results
so that many aspects of the query are covered in the top results; this would ensure that
the zoologist and comedy-fan in our example are not disappointed with the results. If the
British group is to be covered among the top results in a re-ranking based SRD approach for
our example, the approach should consider documents that are as deep in the un-diversified
ranked list as the rank of the first result that relates to the group. In our exploration, we
could not find a result relating to Monty Python among the first five pages of search results
for python on Bing. Such difficulties in covering long tail aspects, as noted in [2], led to
research interest in a slightly different task attacking the same larger goal, that of Diversified
Query Expansion (DQE). Note that techniques to ensure coverage of diverse aspects among
the top results are relevant for apparently unambiguous queries too, though the need is more
pronounced in inherently ambiguous ones. For an unambiguous query: python programming, there are many aspects based on whether the user is interested in books, software or
courses. Similarly, for another seemingly unambiguous query, india, the aspects of interest
could include railways, maps, news and cricket.
DQE is the task of identifying a (small) set of terms (i.e., words) to extend the search
query with, wherein the extended search query could be used in the search system to
retrieve results covering a diverse set of aspects. For our python example, desirable top DQE
1 https://en.wikipedia.org/wiki/Pythonidae
2 https://en.wikipedia.org/wiki/Monty Python
3 https://en.wikipedia.org/wiki/Python (programming language)
World Wide Web
expansion terms would include those relating to the programming language aspect such as
language and programming as well as those relating to the reptile-aspect such as pythonidae
and reptile. In existing work, the extension terms have been identified from sources such as
corpus documents [34], query logs [21], external ontologies [2, 3] or the results of the initial
query [34]. The aspect-affinity of each term is modeled either explicitly [21, 34] or implicitly [2] followed by selection of a subset of candidate words using the Maximum Marginal
Relevance (MMR) principle [5]. This ensures that terms related to many aspects find a place
in the extended set. Diversified Entity Recommendations (DER) is the analogous problem
where the output of interest is a ranked list of entities from a knowledge base such that
diverse query aspects are covered among the top entities.
In this paper, we consider the diversified query expansion problem and develop a three
phase framework to exploit semantic resources for the problem. We use the framework to
develop methods focusing on Wikipedia and pre-learned word embeddings respectively,
leading to techniques that we call Select-Link-Rank (SLR) and Select-Embed-Rank (SER).
Further, we outline how SLR can address diversified entity ranking, and illustrate that SER
results can also be mapped to a corresponding DER result set.
Extension from WISE 2016 Paper In our WISE 2016 paper [18], we had proposed
the SLR method. In this paper, we generalize SLR into a framework, and also develop
another method based on the framework, SER, one targeted at exploiting pre- (...truncated)