MWAND: A New Early Termination Algorithm for Fast and Efficient Query Evaluation (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.ijimai.org/journal/sites/default/files/files/2019/04/ijimai20195_7_6_pdf_83440.pdf

MWAND: A New Early Termination Algorithm for Fast and Efficient Query Evaluation

Regular Issue MWAND: A New Early Termination Algorithm for Fast and Efficient Query Evaluation Zemani Imene Mansouria*, Zekri Lougmiri*, Senouci Mohamed Department of Computer Sciences, LAPECI Laboratory, Ahmed Ben Bella Oran1 University (Algeria) Received 22 October 2018 | Accepted 22 March 2019 | Published 10 April 2019 Abstract Keywords Nowadays, current information systems are so large and maintain huge amount of data. At every time, they process millions of documents and millions of queries. In order to choose the most important responses from this amount of data, it is well to apply what is so called early termination algorithms. These ones attempt to extract the Top-K documents according to a specified increasing monotone function. The principal idea behind is to reach and score the most significant less number of documents. So, they avoid fully processing the whole documents. WAND algorithm is at the state of the art in this area. Despite it is efficient, it is missing effectiveness and precision. In this paper, we propose two contributions, the principal proposal is a new early termination algorithm based on WAND approach, we call it MWAND (Modified WAND). This one is faster and more precise than the first. It has the ability to avoid unnecessary WAND steps. In this work, we integrate a tree structure as an index into WAND and we add new levels in query processing. In the second contribution, we define new fine metrics to ameliorate the evaluation of the retrieved information. The experimental results on real datasets show that MWAND is more efficient than the WAND approach. Evaluation Measures, Information Retrieval, Large Inverted List, MWAND, Query Processing, Top-k, WAND. I. Introduction P RATICAL web search engines are very complex with the goal of returning fast and precise results. The result must be both effective and efficient. These search engines use techniques and algorithms of query processing, such as WAND algorithm, to return a set of ranked documents results named Top-k. These algorithms are executed on a data structure called inverted index [1]. Such structure gives for every term the set of documents in which it appears with additional information, like the term frequencies TF, the list of positions in every document, the format and the size in which it is written. Such construction generates a very large index. In fact, its size is larger than the set of original documents. As a consequence, traversing this index becomes the major bottleneck in query processing. In fact, it is not suitable, really not practical or impossible, to sweep all posting lists. An early termination algorithm is so recommended for such situation. It can return the exact Top-k without scanning the entire posting list. Note here, that a posting list is the part of the inverted index charged in the memory for treatment. We must note also here that the lists are ordered in an ascendant order according to documents numbers [2][3] or on descendant order on TF [4][5]. The choice is done according what the algorithm designer wants. In order to reduce the information representation in the posting lists, a set of compressing techniques have been proposed [6][7][8]. In information retrieval, two major and basic alternatives have been proposed for traversing the posting lists. It is about TAAT (TermAt-A-Time) and DAAT (Document-At-A-Time) strategies [9]. * Corresponding author. E-mail addresses: (Z. I. Mansouria), (Z. Lougmiri) DOI: 10.9781/ijimai.2019.04.002 In fact, SAAT(Score At A Time), GAAT(Graph At A Time) and RAAT(Rank At A Time) and JASS(SAAT) are additional strategies which are proposed for remedying the first strategies weaknesses [10] [11][12]. Since the first works in the field of information retrieval [13], the stopping condition is an interesting part of every early termination algorithm. It consists in ending the execution if k responses are computed even if there are more important results with ranks greater than k. A document is considered relevant if its score is greater or equal to a certain bound. The threshold algorithm TA of Ronald Fagin [14] is one of the most popular algorithms in the context of databases. As information systems are so large and as search engines must deal with large dataset, the WAND algorithm has became unavoidable. It has been used in a number of commercial search engines [15]. It has the ability to skip in an intelligent manner some documents and parts of posting lists according to a precise test, as we will see in next sections. Really, the strongest weakness of early termination algorithms resides in the lack of precision in their responses. For Top-k and for a query q of length Lq, it is usual that WAND misses in Top-r, with r≤k, some documents which share a high number of terms with the query. It is about this ascertainment that we built our solution. In this paper, we focus on early termination and we propose a new extension to algorithm WAND. Our aims are: a) To return all totally relevant documents that contain query terms. b) To reduce the operations number in query processing. c) To ameliorate the results quality by ameliorating the responses precision. In particular, we propose new fine metrics to measure the relevance degree of the returned documents. Compared to the naive approach that contains at least one of the query terms, our approach returns the relevant documents ranked first, without any loss in precision or recall or in new proposed metrics. The reminder of the paper proceeds as follows: Section II gives a - 57 - International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 5, Nº 7 representation of index structure, TAAT / DAAT strategies and early termination. The WAND approach is detailed in section III. In section IV, we quote our contributions and section V presents the details of our propositions. In section VI, we describe our experimental results. Finally, we conclude in section VII and discuss our future work. II. Background In this section, we provide a background on index structure, compression, early termination and index traversing strategies. A. Index Structure In order to evaluate queries in search engines, a well structure is constructed; it gives with precision all information about every term in the set of documents. This structure is called inverted index as it captures the list of documents which contain this term with other information. So, the term is used as a key access to such structure. Every line from this structure is called a posting list when it is charged in memory. This structure was largely presented and explained in literature. Works [7][16][17][18][1] have presented all what can be related to the construction and the use of such structure. For a term ti which appears, a record of its inverted list can have the next format: dj TFij pos1, pos2, … Where dj is a unique document number, called also document identifier, in the set of documents co (...truncated)