MWAND: A New Early Termination Algorithm for Fast and Efficient Query Evaluation
Regular Issue
MWAND: A New Early Termination Algorithm for
Fast and Efficient Query Evaluation
Zemani Imene Mansouria*, Zekri Lougmiri*, Senouci Mohamed
Department of Computer Sciences, LAPECI Laboratory, Ahmed Ben Bella Oran1 University (Algeria)
Received 22 October 2018 | Accepted 22 March 2019 | Published 10 April 2019
Abstract
Keywords
Nowadays, current information systems are so large and maintain huge amount of data. At every time, they
process millions of documents and millions of queries. In order to choose the most important responses from
this amount of data, it is well to apply what is so called early termination algorithms. These ones attempt
to extract the Top-K documents according to a specified increasing monotone function. The principal idea
behind is to reach and score the most significant less number of documents. So, they avoid fully processing
the whole documents. WAND algorithm is at the state of the art in this area. Despite it is efficient, it is missing
effectiveness and precision. In this paper, we propose two contributions, the principal proposal is a new early
termination algorithm based on WAND approach, we call it MWAND (Modified WAND). This one is faster
and more precise than the first. It has the ability to avoid unnecessary WAND steps. In this work, we integrate
a tree structure as an index into WAND and we add new levels in query processing. In the second contribution,
we define new fine metrics to ameliorate the evaluation of the retrieved information. The experimental results
on real datasets show that MWAND is more efficient than the WAND approach.
Evaluation Measures,
Information Retrieval,
Large Inverted List,
MWAND, Query
Processing, Top-k,
WAND.
I. Introduction
P
RATICAL web search engines are very complex with the goal of
returning fast and precise results. The result must be both effective
and efficient. These search engines use techniques and algorithms of
query processing, such as WAND algorithm, to return a set of ranked
documents results named Top-k. These algorithms are executed on
a data structure called inverted index [1]. Such structure gives for
every term the set of documents in which it appears with additional
information, like the term frequencies TF, the list of positions in
every document, the format and the size in which it is written. Such
construction generates a very large index. In fact, its size is larger than
the set of original documents. As a consequence, traversing this index
becomes the major bottleneck in query processing. In fact, it is not
suitable, really not practical or impossible, to sweep all posting lists.
An early termination algorithm is so recommended for such situation.
It can return the exact Top-k without scanning the entire posting list.
Note here, that a posting list is the part of the inverted index charged
in the memory for treatment. We must note also here that the lists are
ordered in an ascendant order according to documents numbers [2][3]
or on descendant order on TF [4][5]. The choice is done according
what the algorithm designer wants. In order to reduce the information
representation in the posting lists, a set of compressing techniques have
been proposed [6][7][8].
In information retrieval, two major and basic alternatives have
been proposed for traversing the posting lists. It is about TAAT (TermAt-A-Time) and DAAT (Document-At-A-Time) strategies [9].
* Corresponding author.
E-mail addresses: (Z. I. Mansouria),
(Z. Lougmiri)
DOI: 10.9781/ijimai.2019.04.002
In fact, SAAT(Score At A Time), GAAT(Graph At A Time) and
RAAT(Rank At A Time) and JASS(SAAT) are additional strategies
which are proposed for remedying the first strategies weaknesses [10]
[11][12].
Since the first works in the field of information retrieval [13], the
stopping condition is an interesting part of every early termination
algorithm. It consists in ending the execution if k responses are
computed even if there are more important results with ranks greater
than k. A document is considered relevant if its score is greater or equal
to a certain bound. The threshold algorithm TA of Ronald Fagin [14]
is one of the most popular algorithms in the context of databases. As
information systems are so large and as search engines must deal with
large dataset, the WAND algorithm has became unavoidable. It has
been used in a number of commercial search engines [15]. It has the
ability to skip in an intelligent manner some documents and parts of
posting lists according to a precise test, as we will see in next sections.
Really, the strongest weakness of early termination algorithms
resides in the lack of precision in their responses. For Top-k and for a
query q of length Lq, it is usual that WAND misses in Top-r, with r≤k,
some documents which share a high number of terms with the query. It
is about this ascertainment that we built our solution. In this paper, we
focus on early termination and we propose a new extension to algorithm
WAND. Our aims are: a) To return all totally relevant documents that
contain query terms. b) To reduce the operations number in query
processing. c) To ameliorate the results quality by ameliorating the
responses precision. In particular, we propose new fine metrics to
measure the relevance degree of the returned documents. Compared
to the naive approach that contains at least one of the query terms, our
approach returns the relevant documents ranked first, without any loss
in precision or recall or in new proposed metrics.
The reminder of the paper proceeds as follows: Section II gives a
- 57 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 5, Nº 7
representation of index structure, TAAT / DAAT strategies and early
termination. The WAND approach is detailed in section III. In section
IV, we quote our contributions and section V presents the details of
our propositions. In section VI, we describe our experimental results.
Finally, we conclude in section VII and discuss our future work.
II. Background
In this section, we provide a background on index structure,
compression, early termination and index traversing strategies.
A. Index Structure
In order to evaluate queries in search engines, a well structure is
constructed; it gives with precision all information about every term
in the set of documents. This structure is called inverted index as it
captures the list of documents which contain this term with other
information. So, the term is used as a key access to such structure.
Every line from this structure is called a posting list when it is charged
in memory. This structure was largely presented and explained in
literature. Works [7][16][17][18][1] have presented all what can be
related to the construction and the use of such structure.
For a term ti which appears, a record of its inverted list can have
the next format:
dj
TFij
pos1, pos2, …
Where dj is a unique document number, called also document
identifier, in the set of documents co (...truncated)