A Hybrid Deep Learning Architecture for Latent Topic-based Image Retrieval

Data Science and Engineering, Apr 2018

Learning effective feature descriptors that bridge the semantic gap between low-level visual features directly extracted from image pixels and the corresponding high-level semantics perceived by humans is a challenging task in image retrieval. This paper proposes a hybrid deep learning architecture (HDLA) that generates sparse latent topic-based representation with the objective of minimizing the semantic gap problem in image retrieval. In fact, HDLA has a deep network structure with a constrained replicated Softmax Model in the lower layer and constrained restricted Boltzmann machines in the upper layers. The advantage of HDLA is that there exist nonnegativity restrictions on the model weights together with \(\ell _1\)-sparsity enforced over the activations of the hidden layer nodes of the network. This, in turn, enhances the modeling power of the network and leads to sparse, parts-based latent topic representation of images. Experimental results on various benchmark datasets show that the proposed model exhibits better generalization ability and the resulting high-level abstraction yields better retrieval performance as compared to state-of-the-art latent topic-based image representation schemes.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs41019-018-0063-7.pdf

A Hybrid Deep Learning Architecture for Latent Topic-based Image Retrieval

Data Science and Engineering A Hybrid Deep Learning Architecture for Latent Topic-based Image Retrieval K. S. Arun 0 1 V. K. Govindan 0 1 0 Department of Computer Science and Engineering, National Institute of Technology Calicut , Kozhikode 673601 , India 1 & K. S. Arun Learning effective feature descriptors that bridge the semantic gap between low-level visual features directly extracted from image pixels and the corresponding high-level semantics perceived by humans is a challenging task in image retrieval. This paper proposes a hybrid deep learning architecture (HDLA) that generates sparse latent topic-based representation with the objective of minimizing the semantic gap problem in image retrieval. In fact, HDLA has a deep network structure with a constrained replicated Softmax Model in the lower layer and constrained restricted Boltzmann machines in the upper layers. The advantage of HDLA is that there exist nonnegativity restrictions on the model weights together with '1-sparsity enforced over the activations of the hidden layer nodes of the network. This, in turn, enhances the modeling power of the network and leads to sparse, parts-based latent topic representation of images. Experimental results on various benchmark datasets show that the proposed model exhibits better generalization ability and the resulting highlevel abstraction yields better retrieval performance as compared to state-of-the-art latent topic-based image representation schemes. Image retrieval; Deep learning 1 Introduction The rapid expansion of digital image repositories poses numerous challenges to computer vision research. Among them, the most important one is the development of accurate and efficient mechanisms to search and retrieve desired images from various digital image repositories. Making use of the feature vectors automatically extracted from image pixels together with a suitable similarity measure, Content-Based Image Retrieval (CBIR) systems enable the search and retrieval of images from large repositories that are identical to the given query image. In CBIR domain, the state-of-the-art approaches are based on BoVW model where images are represented as histograms of visual words. Even though the effectiveness of BoVW model in image retrieval has been proved by many researchers, it still suffers from a major drawback, i.e., the resulting image representation is not as discriminative and descriptive as they are desired to be. This is mainly due to the loss of semantic information of visual words at each processing step of the BoVW model. Therefore, the semantic loss associated with BoVW-based image representation has to be minimized for better retrieval performance. As the clustering operation in BoVW model often fails to take semantic information into account, there is a high probability that the generated visual dictionary contains many ambiguous visual words. These ambiguous visual words hinder the discriminative power of the BoVW-based image representation. The semantic loss in BoVW model can be reduced to a great extent by automatically grouping semantically similar visual words and then encoding images using these newly identified semantic structures. The work presented in this paper follows the above stated principle to derive a low dimensional but highly discriminative feature vector from the original BoVW-based representation for the task of image retrieval. It has been observed that visual polysemy and visual synonymy are the root causes behind the induction of ambiguous visual words in the traditional BoVW model. In general, polysemy and synonymy can be regarded as the representational uncertainty of visual information. Polysemy is the characteristic of a visual word that it corresponds to two or more semantic concepts, while synonymy is the characteristic of two or more visual words that they correspond to the same semantic concept. Polysemy originates as a consequence of the visual appearance diversity of different semantic concepts, and it often leads to low inter-semantic discrimination. On the other hand, synonymy arises due to the appearance-based diversity within a particular semantic concept. Thus, if two semantically dissimilar images have a set of polysemous visual words, then they are closer to each other in the visual word-based feature space. Similarly, synonymous visual words may cause images with same semantics to be far apart in the visual word-based feature space. Therefore, to minimize semantic loss and thus to improve the overall retrieval performance of BoVW-based image retrieval, the issue of polysemy and synonymy needs to be effectively tackled. To mitigate the issue of polysemy and synonymy, researchers proposed to project image representation in the visual word space to an intermediate latent topic space. The underlying idea of latent topics is that not all visual words contain the same amount of information to describe the appearance of images. Therefore, to have better retrieval effectiveness, it is important to use very specific visual words with high discriminative power. This can be achieved by generalizing visual words which share similar meanings to a less specific latent topic. In this way, a set of latent topics h = fh1; h2; . . .; hN g is defined such that a visual word can belong to none, one or several latent Fig. 1 Pictorial representation of the notion of visual words, latent topics and their interrelationship topics. Figure 1 depicts the above-mentioned notion of latent topics in detail. In the end, images are characterized by the proportion of latent topics and this representation is found to be more reliable than the BoVW-based feature while calculating the similarity between images. As latent topics are learned in a completely unsupervised manner, it is not possible to precisely associate a particular semantic concept to each latent topic. However, images with identical latent topic representations are assumed to contain same semantic concepts and are treated as semantically similar while measuring image similarity. Hence, the notion of latent topics considerably minimizes the semantic loss associated with BoVW model and thus increases the discriminative power of the resulting image representation. Numerous latent topic-based image retrieval frameworks are available in the literature, and the majority of these approaches are based on graphical models. Approaches based on graphical models try to maximize the joint distribution of visual words and the latent topics to effectively capture the latent topic structures present in the visual word collection. In general, the joint distribution of visual words and latent topics is modeled using a graphical structure. Graphical model-based latent topic frameworks for image retrieval fall into two fundamental categories such as (i) directed topic models and (ii) undirected topic models. The former category involves models based on directed graphs and the most successful approaches toward this direction are Probabilistic Latent Semantic Analysis (PLSA) [ 1 ], Latent Dirichlet Allocation (LDA) [ 2 ], Correlated Topic Models (CTM) [ 3 ] and Pachinko Allocation Model (PAM) [ 4 ]. On the contrary, undirected topic modeling frameworks encode the joint distribution by means of undirected graphs. Recently, several undirected topic models have been proposed for image retrieval operation. The most popular among them are Rate Adapting Poisson model (RAP) [ 5 ] and Replicated Softmax Model (RSM) [ 6 ]. The major drawback of directed topic modeling schemes is that exact inference is intractable, so they have to rely on approximation algorithms to compute the posterior distribution of latent topics. Another notable limitation is the disjunctive coding principle of directed topic models where they assume a visual word comes from a single latent topic resulting in a suboptimal representation of images. A more accurate latent topic-based image characterization can be obtained with undirected topic models. In general, undirected models are subjected to conjunctive coding principle, and they assume that a visual word always comes from a distribution influenced by all the latent topics. Moreover, accurate and efficient inference techniques have also been developed for undirected models. For these reasons, undirected topic models achieved state-of-the-art performance on large-scale image retrieval as compared to their directed counterparts. This paper investigates the applicability of an undirected deep network for extracting latent topic-based feature descriptors from images to tackle the semantic loss associated with BoVW representation. To this end, an undirected topic modeling scheme named as Hybrid Deep Learning Architecture (HDLA) is proposed and the latent topic-based image representation obtained with the proposed model yields semantically similar images in response to a given query. In particular, this paper makes the following contributions: A hybrid deep learning architecture which is able to model the higher-order correlations among visual words by employing multiple levels of nonlinear transformations. A compact but discriminative image representation well suited for the retrieval task is obtained by directly imposing nonnegativity regulations on the network weights and ‘1 -sparseness constraint on the hidden layer activations. The rest of this paper is organized as follows: Sect. 2 summarizes the related works in latent topics-based image retrieval. Section 3 presents the background study on Restricted Boltzmann Machine (RBM), Replicated Softmax Model (RSM) and Deep Boltzmann Machine (DBM). Section 4 explains the proposed latent topic-based image retrieval framework in detail, including the formulation of the proposed HDLA model and the procedure used to obtain the parameters of HDLA. Section 5 delineates how latent topic-based representation is derived from a previously unseen image and the details of the distance metric used for similarity estimation. Section 6 presents the empirical evaluation of the proposed HDLA model. Finally, the paper is concluded in Sect. 7 by highlighting the advantages of the proposed HDLA model. 2 Related Work Topic models which automatically analyze and discover latent semantic structures from large image collections have been widely explored in image retrieval domain over the past few years. The basic idea behind topic modeling is the mapping of high-dimensional representation of images in the form of BoVW to a much lower-dimensional space defined by the latent topics. Loosely speaking, a latent topic can be viewed as a set of semantically related visual words. Thus, an image containing a large number of visual words can be concisely modeled using a smaller number of latent topics. This permits the easy estimation of semantic image similarity and consequently helps us to improve the overall retrieval effectiveness. A brief review of the most influential topic modeling schemes in image retrieval research is presented in the rest of this section. Latent Semantic Analysis (LSA) [ 7 ] is regarded as the most primitive topic modeling scheme for semantics-based image retrieval. Pecenovic [ 8 ] introduced an LSA-based image modeling framework in which a visual word cooccurrence matrix is initially generated by accumulating the BoVW representation of all the images in the given collection. It is then decomposed into a set of orthogonal factors using Singular Value Decomposition (SVD) with the eigenvectors corresponding to the largest k eigenvalues constitutes the latent topics that represent the relevant semantic structures. When a query image is presented to the system, it is projected into the latent topic space and then the cosine similarity is computed between each indexed images to get a ranked retrieval list. Even though a competent approach, LSA is still computationally intensive. That is, singular value decomposition of the visual word co-occurrence matrix is not practically feasible for large-scale image databases. Directed topic models have been developed to overcome the above-mentioned limitation of LSA. These models are based on the assumption that each image is a mixture of latent topics and each latent topic, in turn, is a distribution over the visual words. Directed topic models are generally represented with graphical structures comprising a set of random variables. The graphical representation mostly involves two different types of random variables: visible and hidden ones. The visible variables represent visual word count extracted from the given image collection, and the hidden variables capture the semantic structures (latent topics) embedded in these visual words. Then, the directed topic models find an optimal set of latent topics that best explains the visual words found in the given images. Comprehensive evaluation of various directed topic modeling schemes on large-scale image data sets has shown promising results in terms of retrieval precision and recall. The last decade has witnessed the emergence of a number of directed topic modeling schemes. The earliest effort in this direction is the Probabilistic Latent Semantic Analysis (PLSA) [ 1 ]. Using PLSA, Zhang et al. [ 9 ] encoded an image by a probability distribution over latent topics with only a few of them assigned with high probability values. The PLSA model presumes each image as a mixture of a finite number of latent topics. Then, the model fitting involves the estimation of topic specific visual word distributions and image specific latent topic distributions from the given database using Maximum Likelihood Estimation (MLE). Experimental results demonstrated the fact that PLSA-based image modeling schemes have shown to perform remarkably well in large-scale image mining operations. In order to capture more accurate semantic structures, several research attempts have been made to enhance various aspects of the original PLSA model. With this objective, Lienhart et al. [ 10 ] proposed a multilayer PLSA architecture by incorporating not just a single layer of hidden variables, but multiple layers with a hierarchy of variables. Hence, information from various modalities can be efficiently integrated to form more meaningful abstractions. On the other hand, Li et al. [ 11 ] introduced correlated PLSA (c-PLSA) which tries to merge inter-image correlations into the basic PLSA formulation and reported promising results in image retrieval tasks. Later on, Chiang et al. [ 12 ] proposed Probabilistic Semantic Component Descriptor (PSCD) whereby the latent topics associated with local image regions are initially identified and then integrate this regional semantics together to form a final image descriptor. However, in PLSA-based image modeling, it is not clear how to infer the topic proportions for an unseen image. That is, the entire model needs to be re-estimated when an image from outside the training dataset is presented as the query. Therefore, the PLSA model and its variants are not scalable. Moreover, the number of parameters to be estimated entirely depends on the size of the image dataset and hence the learned model often tends to overfit the training samples when the number of images in the collection increases linearly. Later on, Blei et al. [ 2 ] formulated a more sophisticated directed topic modeling scheme called Latent Dirichlet Allocation (LDA). Similar to PLSA, LDA assumes that each image is represented by a mixture of fixed number of latent topics and each topic is a mixture over the set of all visual words in the dictionary. In contrast to PLSA, LDA further makes the assumption that these mixture distributions are Dirichlet-distributed random variables whose parameters have to be estimated from the training data. Therefore, once the parameters of Dirichlet distributions are learned, the topic proportions for an unseen image can be predicted easily which is not the case with PLSAbased models. Moreover, the Dirichlet prior to the perdocument topic distribution significantly reduces the effect overfitting. Horster et al. [ 13 ] investigated the applicability of LDA in the context of semantic image modeling and demonstrated its effectiveness in query-by-example-based image retrieval settings. Due to its good scalability, the LDA model is further extended by many researchers. One such simple extension is the Correlated Topic Model (CTM) [ 14 ]. It is similar to LDA except that instead of drawing topic mixture proportions from a Dirichlet distribution, it does so from a logistic normal distribution. Thus, the parameters of CTM involve a covariance matrix whose entries represent the correlation between all pair of latent topics. Greif et al. [ 14 ] adopted CTM to explicitly model topic correlation to derive a lower-dimensional latent topic vector and is found to be superior to LDA. As the pairwise correlation of latent topics are modeled by CTM, the number of parameters in the covariance matrix grows as the square of the number of latent topics. Recently, the Pachinko Allocation Model (PAM) [ 15 ] emerged as a flexible alternative to CTM. In PAM, the nested correlation among latent topics is efficiently modeled. It does so by extending the concept of latent topics to be distributions not only over the visual words but also over other latent topics. Using image data from large-scale databases, Boulemden and Tilli [ 4 ] reported improved performance of PAM-based latent topic representation in image retrieval operation. It should be noted that inferring posterior distribution of latent topics in directed topic modeling schemes such as LDA and its extensions is typically intractable. In general, approximate inference techniques such as variational methods [ 16 ], expectation propagation [ 17 ] and Gibbs sampling [ 18 ] are utilized to solve this problem. However, these inference algorithms are computationally expensive and time-consuming especially for larger datasets. Another alternative for topic modeling is the construction of undirected graphical models. As stated before, the visible nodes of undirected graph accept BoVW representation of input images and the hidden nodes indicate the latent topics learned from the given images. In fact, these nodes in undirected topic models are arranged in layers with the visible nodes constitute the first layer and the hidden nodes form the second layer. This layered architecture has an important characteristic that the nodes in one layer are conditionally independent given the values of the nodes in the opposite layer. With this type of architecture, the mapping from input space (i.e., visual words) to latent topics can be done by a simple matrix multiplication. As a result, the overall retrieval performance, where speed is a primary concern, can be significantly improved. Additionally, undirected models generate distributed latent topic representation and are proven to be superior to the representations obtained with directed topic models for the task of image retrieval. To date, only a handful of cases have been reported in image retrieval literature using undirected topic models. The Rate Adapting Poisson model (RAP) [ 5 ] is one of the earlier works in this direction. In this model, it is assumed that the distribution of the hidden nodes is Binomial and that of the visible nodes is Poisson. Even though RAPbased image retrieval framework performs well in terms of retrieval accuracy, the parameter learning process is unstable and hard. Recently, there has been great interest in using Replicated Softmax Model (RSM) [ 6 ] for large-scale image retrieval. It is basically a generalization of Restricted Boltzmann Machine (RBM) [ 19 ]. The advantage of using RSM over RAP for deriving high-level image abstractions is that parameter estimation is faster and stable. The Replicated Softmax Model is trained using a fairly efficient learning procedure known as the contrastive divergence algorithm [ 20 ]. More importantly, the generalization ability of RSM for unseen images is far better than other models and, this in turn, considerably enhances the overall retrieval performance. More recently, high-level abstraction of text documents learned using a Deep Boltzmann Machine (DBM)-based formulation called Over Replicated Softmax Model (ORSM) [ 21 ] demonstrated promising results for the task of text document classification and retrieval. It has been observed that the high-level abstraction obtained with ORSM has better generalization performance on unseen data as compared to other topic modeling schemes. Encouraged by the recent success of ORSM in modeling text documents, this paper investigates the applicability of an undirected deep learning architecture for extracting efficient latent topic-based representations of images. To summarize, the effectiveness of topic modeling schemes entirely depends on the quality of the latent topics discovered. It turns out that majority of the above-mentioned models still generate latent topics of inferior quality. This leads to a poor semantic characterization of images and hence degrades the overall retrieval performance. It has been observed that deep network models with many layers of latent topic variables can somehow solve the above-mentioned shortcoming. However, selecting an optimum value for the number of latent topics in each hidden layer is not a straightforward task in such deep models. That is, it should be large enough to fit the characteristics of the image data at hand and at the same time small enough to filter out the irrelevant representational K vtest L TL u h b a W g rð:Þ U M details. In this scenario, a sparse feature representation [ 22 ] where only a few latent topics describe the information that we are anticipating does the trick. Therefore, this paper investigates a hybrid deep learning architecture that generates sparse, parts-based characterization of images using latent topics and is found to be compatible for large-scale image retrieval. 3 Preliminaries Before the proposed model is introduced, it is important to understand deep learning models which are in fact the stepping stone toward the newly proposed hybrid deep learning architecture. To keep things simple, this section provides a detailed overview of Restricted Boltzmann Machine (RBM) [ 19 ] and its special cases such as and Replicated Softmax Model (RSM) [ 6 ] and Deep Boltzmann Machine (DBM) [ 23 ]. To begin with, RBM is examined by elaborating the contrastive divergence algorithm [ 20 ] for deriving the model parameters. Then, the theory behind RSM is outlined, which is useful for modeling visual wordcount vectors extracted from images. Finally, the working principle of DBM is explained along with the layer-bylayer training procedure to learn its model parameters. Let us first introduce the main notations used in this paper. Some of them are used in this section, and the rest are used in subsequent section where the formulation of the proposed HDLA model is described. All these notations are summarized in Table 1. 3.1 Restricted Boltzmann Machine A Restricted Boltzmann Machine (RBM) is an undirected probabilistic graphical model-based formulation with a bipartite structure. As depicted in Fig. 2, there exist two layers of binary stochastic units in RBM namely the visible layer u = ½u1; u2; . . .; uK and the hidden layer h= ½h1; h2; . . .; hT . The visible layer nodes correspond to observed data, and the nodes in the hidden layer capture the dependencies among the observed data. There is a connection between each node in the visible layer to all the nodes in the hidden layer and vice versa. There is no link between the nodes within the same layer. In its standard form, the visible and hidden layer units of RBM are binaryvalued. That is, the space of visible vectors for a binary RBM is u = f0; 1gK , while the space of hidden unit vectors is h = f0; 1gT . Associated with each nodes in the visible and the hidden layers, there exist bias units and the corresponding bias offsets are represented by b = ½b1; b2; . . .; bK and a = ½a1; a2; . . .; aT . The interaction between a visible layer node i and a hidden layer node j is quantified by a real-valued weight wij. The pairwise weights between all the elements of u and h are generally summarized by a symmetric weight matrix W. It is important to note that RBMs are special cases of Energy-Based Models (EBM), in which the relationships among variables are modeled by assigning energy values to each of their joint configurations. Then, the model parameters of RBM are learned by minimizing the energy of all the desirable configurations of the state space vectors. The following function computes the energy value for the joint configuration of visible and hidden layer nodes (u,h): where H ¼ ½W; b; a is the model parameter vector. Based on this energy function, the model can further assign probabilities to every possible state vector pairs of visible and hidden units. This joint distribution is defined by: 1 ZðHÞ pðu; h; HÞ ¼ expð Eðu; h; HÞÞ where ZðHÞ is a normalization constant also known as the partition function. The value of ZðHÞ is computed as follows: ZðHÞ ¼ X X expð Eðu; h; HÞÞ u h Similarly, the model can assign probability to the visible vector u in the following fashion: Because of the bipartite structure of RBM, the conditional distribution over visible vector u and hidden units h can be easily derived from Eq. (2) and is given by: where the individual activation probabilities pðhj j uÞ, pðui j hÞ are defined as follows: where rð:Þ is the logistic sigmoid function defined as rðyÞ ¼ 1=ð1 þ expð yÞÞ. Thus, RBM is a powerful generative model capable to capture the covariance structure present in the given input observations in a completely unsupervised fashion. This helps to group semantically similar visual words into a relatively small number of latent topics, and thus a more efficient latent topic-based image characterization can be derived with RBM-based image modeling. The next section provides a detailed description of the training procedure used to learn the model parameters of RBM. 3.1.1 Contrastive Divergence Algorithm The Restricted Boltzmann Machine is trained in such a way that the obtained model parameter H should minimize the negative log-likelihood of the given training data set. Let D ¼ f u 1, u 2, . . ., u N g be the set of independent and identically distributed training samples, then the log-likelihood of S is given by: N ‘ðD; HÞ ¼ ln Y pðui; HÞ ¼ i¼1 N X ln pðui; HÞ i¼1 The unknown parameter vector H of the RBM is then learned by solving the following optimization problem: The stochastic gradient descent procedure is then used to optimize the model parameter values. The gradient decent procedure updates the parameter vector H as: Hðmþ1Þ ¼ HðmÞ þ DH where m is the number of epoch, and it indicates the total presentations of the full training set to the learning algorithm. DH is the change in the parameter vector H. In each epoch, DH is initialized to zero and subsequently changed in a direction that minimizes the negative log-likelihood as shown below: DH ¼ g o‘ðHÞ oH where g is the learning rate, and it indicates the relative size of the changes in the parameter vector H. For the model defined in Eq. (1), the gradient of the log-likelihood given a single training example us is given by: o o h oH ‘ðus; HÞ ¼ oH ln pðus; HÞ i o ¼ oH ln 1 X expð Eðus; hÞÞ ZðHÞ h ¼ ooH h ln Xh expð Eðus; hÞÞi ooH h ln Xu Xh expð Eðu; hÞÞ 1 o X expð Eðus; hÞÞ oH Eðus; hÞ ¼ P expð Eðus; hÞÞ h h i 1 o X X expð Eðu; hÞÞ oH Eðu; hÞ þ P P expð Eðu; hÞÞ u h u h Therefore, the gradient of the log-likelihood is the difference between two expectations. The first term of Eq. (13) is the expectation of the gradient of the energy function with respect to pðh j usÞ and is termed as data-dependent expectation. Similarly, the second term is the expectation of the gradient of the energy function with respect to pðu; hÞ and is known as model-dependent expectation. As both the terms involve expectations, the gradient of the loglikelihood can be rewritten as: o oH ‘ðus; HÞ ¼ h o i h o EpðhjusÞ oH Eðus; hÞ þ Epðu;hÞ oH Eðu; hÞ i where the shorthand notation EpðhjusÞ½: denotes the datadependent expectation and Epðu;hÞ½: represents the modeldependent expectation. The derivative of the negative energy function with respect to all the model parameters H ¼ ½W; b; a can easily be computed as follows: ooW ð Eðu; hÞÞ ¼ ooW uT Wh ¼ uhT o o aT h ¼ h oa ð Eðu; hÞÞ ¼ oa oob ð Eðu; hÞÞ ¼ oob bT u ¼ u ð14Þ ð15Þ ð16Þ Now the derivative of the log-likelihood of a given training pattern us with respect to the weights W, visible layer bias b and hidden layer bias a becomes: o oW ‘ðus; WÞ ¼ EpðhjusÞ½ushT o oa ‘ðus; aÞ ¼ EpðhjusÞ½h o‘ðus; bÞ ob ¼ EpðhjusÞ½us Epðu;hÞ½h Epðu;hÞ½us Epðu;hÞ½ushT The conditional independence property of RBM ensures an easy estimation of the data-dependent expectation. On the other hand, the model-dependent expectation involves a sum over all 2K elements of u as well as the 2T elements of h. Therefore, exact computation of the data-dependent expectation is intractable because its complexity is exponential in the number of visible and hidden layer nodes. To avoid this computational burden, the data-dependent expectation can be approximated by generating a finite number of samples from the joint distribution pðu; hÞ using the Markov Chain Monte Carlo (MCMC) [ 24 ] technique. The classical MCMC approach makes use of Gibbs sampling [ 18 ] to generate samples from a joint distribution of multiple random variables. The basic idea is to construct a Markov chain by updating each random variable based on its conditional distribution, given the state of the others. That is, to get a sample from a joint distribution pðy1; y2; . . .; ycÞ of c random variables, Gibbs sampling performs a sequence of r sampling steps of the form yi Pðyi j y iÞ, where y i represents the ensemble of the ðc 1Þ random variables other than yi. Since an RBM consists of conditionally independent visible and hidden units, Gibbs sampling can be easily applied to get samples from the joint distribution pðu; hÞ. The variables in the Gibbs sampling process, then the Monte Carlo approximation of Epðu;hÞh ooh Eðu; hÞi is given by: h o Epðu;hÞ oH Eðu; hÞ i 1 Xn o n i¼1 oH Eðui; hiÞ Consequently, the derivative of the log-likelihood for the given training sample us can be approximated by: o oH ‘ðus; HÞ o 1 Xn o Xh pðh j usÞ oH Eðus; hÞ þ n i¼1 oH Eðui; hiÞ However, obtaining unbiased samples from RBM distribution using MCMC method typically requires many sampling steps. As a result, the computation of log-likelihood remains intractable for large-scale image data sets. Recently, it has been shown that running the Markov chain for just a few steps is sufficient for estimating the loglikelihood gradient specified in Eq. (19). This leads to Contrastive Divergence (CD) algorithm [ 20 ] and is now the most commonly used method for RBM training. hidden layer units are sampled simultaneously given fixed values for the variables in the visible layer. Similarly, visible layer variables are sampled simultaneously given the hidden layer variables. Thus, step (t) of the Gibbs sampling process for the RBM defined in Eq. (2) has the following two phases: hjðtÞ uiðtÞ pðhj j uðt 1ÞÞ pðui j hðtÞÞ where hðtÞ, uðt 1Þ refers to the set of all hidden and visible layer units at steps (t) and ðt 1Þ of the Gibbs sampling procedure. Similarly, hjðtÞ, uiðtÞ are the j-th hidden layer unit and the i-th visible layer unit of the model at step (t) of the Gibbs sampling procedure. It is assumed that as t ! 1, Gibbs sampling is guaranteed to generate accurate samples of pðu; hÞ. Once sufficient number of samples are obtained with Gibbs sampling, the Monte Carlo approach can be used to approximate the model-dependent expectation specified in Eq. (14). Let fðu1; h1Þ; ðu2; h2Þ; . . .; ðun; hnÞg be a set of samples drawn from pðu; hÞ using the above-mentioned ð17Þ ð18Þ ð19Þ Instead of waiting for the Gibbs chain to converge, the k-step Contrastive Divergence (CDk) algorithm runs the chain for only k steps. That is, the chain starts from an input observation us of the training set (i.e., uð0Þ ¼ us) and yields the sample uðkÞ by performing k steps of Gibbs sampling. Each step t of CDk consists of sampling hðtÞ from pðh j uðt 1ÞÞ and then sampling uðtÞ from pðu j hðtÞÞ. Finally, the gradient in Eq. (19) can be written as: CDkðH; uð0Þ ¼ usÞ ¼ o X pðh j uð0ÞÞ oH Eðuð0Þ; hÞ h þ o X pðh j uðkÞÞ oH EðuðkÞ; hÞ h ð20Þ Hinton et al. [ 20 ] empirically found that the learning algorithm converges closer to the exact maximum likelihood even for small values of k (often just one step). A batch-based version of CDk has been presented in Algorithm 1. In batch-based training protocol, all input observations are presented to the model before the parameter update takes place. The algorithm makes several epochs through the training data so as to get a final estimate of the parameter vector H. For an input observation u s of the training set (i.e., uð0Þ ¼ us), the following rules are used by the k-step Contrastive Divergence algorithm to update the weights and biases of the model. model is an N K binary matrix U with Uik = 1 if and only if the i-th interest point in the given image is assigned to the k-th visual word and is given by: ð21Þ where g [ 0 is the learning rate of RBM. Once the unknown parameters are estimated, RBM generates a T-dimensional latent topic-based representation pðh j unewÞ for an unseen input unew supplied to the model. The newly generated feature vector provides a quantitative description of the latent topic structure associated with the unseen input unew. Moreover, the dimensionality of the obtained representation is considerably lower than that of the actual input. All these characteristics make RBM an ideal tool for latent topic-based image modeling. 3.2 Replicated Softmax Model From the previous section, it is well understood that RBMs only deal with input observations from a Bernoulli distribution. While modeling an image characterized by a visual dictionary, we are interested in the occurrence frequency of visual words in the given image. However, the visual wordcount vectors cannot be modeled by RBMs with binaryvalued (Bernoulli) input units. Therefore, Salakhutdinov and Hinton [ 6 ] proposed Replicated Softmax Model (RSM) as a variant of RBM to model visual word-count data. The nodes in the visible layer are modeled as Softmax units and can have one of many different states. A graphical representation of the RSM framework is depicted in Fig. 3a. Let K be the size of the visual dictionary learned from a set of training images and N be the number of interest points detected in the given image, then the input data to the RSM (a) (b) Let h 2 f0; 1gT be the binary stochastic latent topic feature, then the energy of the RSM model for the configuration fu; hg is defined as: Eðu; h; HÞ ¼ WijnhjUni XN XT XK n¼1 j¼1 i¼1 XN XK where H ¼ ½W; a; b are the model parameters in which W = ½Wijn denotes the connection strength between the ith visible layer unit corresponding to the nth interest point in the given image and the j-th hidden layer unit. b = ½bni is the bias associated with the ith visible unit of the nth interest point in the given image and a is the bias of the hidden layer h. The concept of weight sharing simplifies the basic formulation of RSM specified in Eq. (23). Weight sharing ignores the sequence in which local descriptors occurs in the input image. That is, if the ith visible unit of the nth local image descriptor is forced to share its weight with the ith visible unit of all other local descriptors, then Wijn can be simply redefined as Wij. This procedure is illustrated in Fig. 3b. With this modification, the input binary matrix U of the RSM framework can be replaced with K visible layer nodes U ¼ ½u1; u2; . . .; uK each of them corresponds to a distinct visual word in the learned dictionary. The nodes in the visible layer U are shown using concentric circles to indicate replication, i.e. the number of times each visual word occurs in the given image. The weight sharing operation brings down the total number of parameters to be learned from ðN T KÞ to ðT KÞ, and it helps RSM to model images with a varying number of visual words. The energy of the configuration fU; hg after weight sharing is then defined as: EðU; h; HÞ ¼ XT XK j¼1 i¼1 Wijhju^i K X i¼1 u^ibi T N X hjaj j¼1 ð22Þ ð23Þ ð24Þ where u^i ¼ PnN¼1 Uni denotes the frequency with which the i-th visual word appears in the given image. It should be noted that the bias term for the hidden unit is scaled by the number of interest points N. This scaling is crucial as it allows hidden units to behave sensibly when dealing with documents of different lengths. Then, the probability that the model assigns to a visible binary matrix U is given by: pðU; HÞ ¼ 1 X expð EðU; h : HÞÞ ZðhÞ h where ZðHÞ is known as the partition function and is defined as: ZðHÞ ¼ X X expð EðU; h; HÞÞ ð26Þ U h The conditional probabilities of visual words and latent topics are expressed in terms of Softmax and logistic sigmoid functions defined as follows: exp bi þ PjT¼1 Wijhj pðui ¼ 1 j hÞ ¼ K T P exp bk þ P Wkjhj k¼1 j¼1 pðhj ¼ 1 j uÞ ¼ r Naj þ u^iWij K X i¼1 The major advantage of using Softmax units in RSM is that the principle behind parameter estimation remains the same as that of RBM. Thus, the weights and bias of RSM are optimized by applying the contrastive divergence algorithm to the log-likelihood gradient. By following the same conventions as used in RBM, the update rules for the model parameters of RSM can be derived as follows: DWij ¼ DWij þ g pðhj ¼ 1 j Uð0ÞÞu^ið0Þ pðhj ¼ 1 j UðkÞÞu^iðkÞ Dbi ¼ Dbi þ g u^ið0Þ u^ðkÞ i Daj ¼ Daj þ g pðhj ¼ 1 j Uð0ÞÞ pðhj ¼ 1 j UðkÞÞ where Uð0Þ ¼ ½uð10Þ; uð0Þ; . . .; uðK0Þ is an input observation 2 from the training set from which the Gibbs chain starts and UðkÞ is the resulting sample after performing k-steps of Gibbs sampling. 3.3 Deep Boltzmann Machine Similar to RBM, a Deep Boltzmann Machine (DBM) [ 23 ] is also an energy-based, undirected graphical model. It is a composite of a single visible layer and multiple hidden layers. It can be viewed as a number of RBMs that are stacked on top of each other. The detailed architecture of a Deep Boltzmann Machine with L hidden layers is shown in ð25Þ ð27Þ ð28Þ ð29Þ Fig. 4 Graphical representation of deep Boltzmann machine with L hidden layers [ 23 ] Fig. 4. There are connections only between adjacent hidden layer units as well as units in the visible layer and the first hidden layer. Because of the deep hierarchical structure, DBM has greater flexibility and good representation power while modeling complex data distributions. That is, DBM can generate more structured and abstract representations of input observations. Consider a Deep Boltzmann Machine with one input layer u ¼ fu1; u2; . . .; uK g 2 f0; 1gK and a series of L hidden layer units h ¼ fhð1Þ 2 f0; 1gT1 ; h2 2 f0; 1gT2 ; . . .; hL 2 f0; 1gTL g. Then, the energy of the joint configuration fu; hg is defined as: Eðu; h; HÞ ¼ þ K T1 X X uihjð1Þwij i¼1 j¼1 L X XT‘ hjð‘Þajð‘Þ ‘¼1 j¼1 K X biui i¼1 T1 X ajð1Þhjð1Þ j¼1 T‘ 1 XT‘ hð‘ 1Þhð‘Þwjðk‘Þ X j k j¼1 k¼1 where hð‘Þ ¼ ½hð1‘Þ; hð2‘Þ; . . .; hðT‘‘Þ denotes the ‘-th hidden layer of the DBM and it contains Tl number of binaryvalued hidden units. W ¼ ½wij represents the weights between nodes in the visible layer and the nodes in the first hidden layer hð1Þ. bi is the bias term associated with i-th visible layer node ui. Wð‘Þ ¼ ½wjðk‘Þ where 1 weight between the j-th node in the hidden layer h ð‘Þ and the k-th node in the hidden layer hð‘þ1Þ. ajð‘Þ are the bias ‘ L is the ð30Þ terms associated with j-th node in the hidden layer h ð‘Þ. All these model parameters are represented by the vector H. The probability that the model assigns to a visible vector u is then given by the Boltzmann distribution of the following form: Based on the above formulation the conditional distribution of each hidden layer ‘, where 2 ‘\L, of the DBM can be expressed as: pðhjð‘Þ ¼ 1 j hð‘ 1Þ; hð‘þ1ÞÞ ¼ r þ T‘ 1 X hð‘ 1Þwðk‘jÞ k i¼1 k¼1 XT‘þ1 hið‘þ1Þwiðj‘þ1Þ þ ajð‘Þ The conditional distribution over the last hidden layer h ðLÞ is defined as: pðhjðLÞ j hðL 1ÞÞ ¼ r pðui ¼ 1 j hð1ÞÞ ¼ r pðhjð1Þ ¼ 1 j u; hð2ÞÞ ¼ r k¼1 XTL wðkLjÞhjðLÞ þ ajðLÞ T1 X hjð1Þwiðj1Þ þ bi j¼1 Similarly, the conditional distribution of the visible layer u and first hidden layer h ð1Þ is given by: k¼1 T2 X wðk2jÞhð2Þ j þ K X i¼1 wiðj1Þui þ ajð1Þ where rð:Þ is the logistic sigmoid function defined as rðyÞ ¼ 1=ð1 þ expð yÞÞ. The previously mentioned maximum-likelihood learning procedure is also applicable to estimate the model parameters of DBM. However, it should be noted that this algorithm is rather slow, especially for deep architectures with multiple layers of hidden units where the upper layers are quite remote from the visible layer. This limitation can be effectively resolved using the greedy layer-wise learning strategy [ 25 ] and is briefly reviewed in the following subsection. This layer-wise training strategy is extended by the proposed HDLA to learn the model parameters. 3.3.1 The Layer-Wise Training Strategy for DBM Parameter learning in DBM is performed using an unsupervised layer-wise training procedure. In this approach, the layers of DBM are grouped pairwise to form a ð31Þ ð32Þ ð33Þ ð34Þ ð35Þ sequence of RBMs. Then, the RBMs in the stack are trained independently in a bottom-up fashion such that successive RBMs use the samples drawn from the joint distribution of the visible and hidden layers of the previous RBM in the hierarchy as their input data. The entire learning procedure for a DBM with L hidden layers is summarized in Algorithm 2. In layer-by-layer training procedure, the first RBM in the hierarchy is trained to model the given input observation. That is, the visible layer u of the first RBM accepts the input observations and models it using the k-step contrastive divergence algorithm. After training the first RBM, a sufficiently large number of samples are generated from the joint distribution p(u j h)as the input data for the next RBM in hierarchy (step 3 of Algorithm 2). While training the remaining portion of the DBM, only two layers h ð‘ 1Þ and h ð‘Þ of the network are considered at a time with the assumption that h ð‘ 1Þ is known and fixed. Then, the joint distribution p(h ðl 1Þ, h ð‘Þ) of these two layers is approximated as if they constitute an isolated Restricted Boltzmann Machine and its parameters are learned by maximizing the likelihood p(h ð‘ 1Þ). The k-step contrastive divergence learning procedure mentioned in Algorithm 1 is used for this purpose. Since all the edges are undirected, each hidden layer nodes except those in the last hidden layer of the DBM accept signals from the upper and the lower layer nodes as indicated in Eq. (32). Hence, the training algorithm must account for the top-down and the bottom-up interaction terms while learning the parameters of DBM. With this objective, Salakhutdinov and Hinton [ 25 ] modified the structure of the RBMs in the entire stack before the actual training begins. For instance, the following changes have been made to the structure of RBMs while training a DBM with three hidden layers as shown in Fig. 5b. Initially, the first layer RBM is altered to have two copies of visible layer nodes along with tied weights. The newly added visible layer nodes compensate for the lack of topdown interaction terms from the second layer. Similarly, the structure of the third layer RBM is modified in such a way that it involves two copies of hidden layer units h ð3Þ and the respective weight matrix W ð3Þ to compensate for the lack of bottom-up interactions from RBM-2. For the intermediate layer, the RBM is restructured such that only the connection strengths W ð2Þ are doubled. Salakhutdinov and Hinton [ 25 ] were able to show that the layer-wise training of DBM with this type of structural modification is guaranteed to yield optimal values for the model parameters. Fig. 5 Greedy learning strategy for DBM with three hidden layers [ 25 ] 4 The Proposed Image Retrieval Framework The proposed HDLA model for latent topic-based image retrieval mainly involves two processing steps. The first step is fitting the HDLA model to the entire training images. In this step, the parameters of the HDLA model are learned from the training images, and it proceeds in three stages: (i) visual dictionary learning (ii) generating Bag of Visual Word (BoVW) representation of the training images and (iii) layerby-layer training of the HDLA model in an unsupervised fashion. The second processing step is testing the learned HDLA model and thereby inferring latent topic-based representation of previously unseen images for the task of CBIR. To obtain the visual dictionary, each image in the training collection is decomposed into non-overlapping, fixed size local image blocks. Then, scattering transform coefficients [ 26 ] are extracted from all these local image patches to form the feature space. Finally, the local image feature space is quantized into a predefined number (K) of clusters using the K-means algorithm. Each of the resulting cluster center is termed as a visual word and the set of all visual words thus obtained are termed as a visual dictionary. The BoVW representation of the images in the training collection is generated by decomposing each of the images into local patches and are then represented by means of scattering transform coefficients. The local image descriptors thus obtained are then mapped to the nearest visual word in the initially constructed visual dictionary. Finally, the number of occurrence of each visual word over the entire image is computed to form a K-dimensional feature vector popularly known as BoVW representation. The HDLA model has a layered hierarchical structure where the processing elements are called nodes. There is one layer of visible nodes and multiple layers of hidden nodes stacked on top of one another to constitute the HDLA model. The nodes of any two adjacent layers are bidirectionally connected through weights, and it serves as the model parameters. Each layer of the HDLA model generates activation probability conditioned on the corresponding inputs, and it mainly depends on the model weights. As the visible layer accepts the visual word count in the form of BoVW representation of training images, the lowest level in the HDLA model is an RSM with additional constraints on its weights and activation probabilities. The upper hidden layers of the HDLA model are paired together to form a hierarchy o Restricted Boltzmann Machines. The hidden layer nodes in HDLA capture the higher-order correlation among visual words and thereby group semantically identical visual words together to form latent topics. The output of the topmost hidden layer will be the latent topic distribution of the given image and is employed for the task of image retrieval. We use a greedy layer-wise training strategy to learn the parameters of the proposed HDLA model, and it leads to iterative update rules for the parameters of individual layers. The basic idea of the layer-wise training strategy is to train the HDLA model one layer at a time, starting from the first layer. The principle of maximum likelihood is employed to learn the parameters of individual layer in the HDLA model. Thus, for a given collection of training images, the parameters of individual layers are learned in such a way that gives the highest possible probability to the given training data. Given a previously unseen image (Itest) in the testing phase of the proposed HDLA model, its BoVW representation (vtest) is obtained based on the initially created visual dictionary and this BoVW representation is then presented as input to the visible layer of the HDLA model. The latent topic distribution of the test image is then computed as the activation probability pðhL j vtestÞ of the topmost hidden layer in the HDLA model conditioned on the BoVW representation of the given test image. A ranked list of database images is then prepared on the basis of this latent topic features. Figure 6 shows graphically the process for both training and testing the proposed HDLA for the task of image retrieval. The rest of this section provides the implementation details of the proposed HDLA model. 4.1 The Hybrid Deep Learning Architecture As mentioned earlier, latent topic representation obtained with Deep Boltzmann Machine-based architecture possesses good generalization ability. Deep Boltzmann Machine has multiple layers of processing modules stacked on top of one another, and each unsupervised module in this hierarchy is provided with the representation vectors from the lower level module. Thus, the latent topic vector in the upper-layer capture the high-level dependencies among input variables and thereby improve the ability of the system to learn complex distributions present in the input data. However, the fully distributed representation yielded by DBM often fails to capture the constituent parts or factors of the input observations. In other words, the high-level abstraction generated by DBM often lacks the inherent meaning of adding parts to form a whole. In fact, ‘‘partbased’’ representation [ 27 ] ensures non-subtractive combinations of components to form the given input. Therefore, by restricting the network weights of DBM to nonnegative values yield a ‘‘part-based’’ representation of input data and it possibly enhances the expressive power of the basic DBM model. Another possibility for improving the performance of DBM is the incorporation of sparsity into the learned representation. In sparse feature coding [ 28 ], the final representation is forced to have only a few non-zero components, and most of the remaining entries are zero. Hence, sparsity is an effective constraint for performance enhancement where there is no intimation about the required number of hidden layers in DBM and the amount of hidden units required in successive layers while creating an optimal deep network that efficiently discovers interesting structures embedded in the input data. Considering the above-mentioned factors, this paper proposes a Hybrid Deep Learning Architecture (HDLA) which uses a Constrained Replicated Softmax Model (CRSM) in the lowest level together with Constrained Restricted Boltzmann Machines (CRBMs) in the higher layers. The proposed architecture integrates a quadratic barrier function [ 29 ] into the objective function of both CRSM and CRBM so that learning is skewed toward nonnegative weights. With this formulation, the contribution of lower layer units toward each unit in the next higher layer becomes additive in nature. In addition to this, ‘1-regularization term is also added to the objective functions of RSM and RBM to enforce sparseness of the final representation. The basic architecture of the proposed model is shown in Fig. 7. The following subsections provide a detailed description of the Constrained Replicated Softmax Model (CRSM) and the Constrained Restricted Boltzmann Machine (CRBM) which add up to form the proposed HDLA model to infer latent topic-based image representation applicable for the retrieval operation. 4.1.1 Constrained Replicated Softmax Model This section presents a modified version of the Replicated Softmax Model (RSM) named CRSM which serves as the base-level processing module in the proposed HDLA model. Let U ¼ ðu1; u2; . . .; uK Þ 2 f1; 2; . . .; PgK denote the set of visible variables and h = ðhð11Þ; hð21Þ; . . .; hðT11ÞÞ 2 f0; 1gT1 indicate the set of hidden nodes of CRSM. The input to the visible units of CRSM is the visual word-count vectors and to learn an optimum fitting distribution for any given set of m data samples fU1; U2; . . .; Umg CRSM attempt to solve the following minimization problem. J1ðH1Þ ¼ mHi1n Xm ln hpðUs; H1Þi þ b1 XK XT1 f ðwijÞ s¼1 i¼1 j¼1 m þ c1 X f pðhð1Þ j UsÞ s¼1 ð36Þ T1 X k¼1 Thus, the objective function is the sum of a log-likelihood term and two regularization terms. To estimate the model parameters, the stochastic gradient descent procedure is used. Then, the derivative of Eq. (36) with respect to the model parameter H1 for a given sample Us consists of three terms as shown below: o oH1 J1ðUs; H1Þ ¼ þ a þ b o h oH1 ln pðUs; H1Þ o h XK T1 X f ðwijÞ oH1 i¼1 j¼1 i i o h oH1 i In fact, the contrastive divergence learning procedure provides an efficient approximation to the gradient of the log-likelihood term present in Eq. (39). Hence on every iteration, the contrastive divergence algorithm is applied followed by one step of gradient descent using the derivative of the two regularization terms. Thus, for an input observation Us of the training set (i.e., U0 ¼ Us) the parameters of CRSM are updated as follows: wij ¼ wij þ g pðhjð1Þ ¼ 1 j U0Þui0 pðhjð1Þ ¼ 1 j UkÞuik þ b1ddwijee þ c1Mwij ð39Þ ð40Þ ð41Þ ð42Þ where H1 ¼ ½W; a; b is the model parameter vector in which a ¼ ½a1; a2; . . .; aT1 and b ¼ ½b1; b2; . . .; bK represent the bias of hidden layer h ð1Þ and visible layer U, respectively, W ¼ ½wij denote the weight between the i-th visible layer node and the j-th hidden layer unit. ln ½pðUs; h1Þ is the log-likelihood of the training sample Us and is computed by taking the logarithm of the probability value defined in Eq. (25). f ðwijÞ is the quadratic barrier function which enforces nonnegativity restriction on the model weights, f pðhð1Þ j UsÞ is the ‘1-regularization term which is used to enforce sparsity on the latent topic representation learned by CRSM. b1, c1 are the weight penalty term and the sparse hyper-parameter of CRSM. They, respectively, control the level of nonnegativity of connection weight matrix W and the sparsity of hidden layer activation pðhð1Þ j UsÞ. The quadratic barrier function is defined as follows: ( wij2; 0; wij\0 wij 0 ð37Þ The sparse regularization term which makes the hidden activation of CRSM to be sparse is written as: f ðwijÞ ¼ pðhjð1Þ ¼ 1 j U0Þ pðhjð1Þ ¼ 1 j Uk þ c1Maj bi ¼ bi þ g ui0 uik where the complete description of the terms ddwijee , Mwij and Maj are provided in ‘‘Appendix A’’. 4.1.2 Constrained Restricted Boltzmann Machine The higher-level processing modules of the proposed HDLA formulation are termed as Constrained Restricted Boltzmann Machines (CRBMs). There are L CRBM modules in the proposed HDLA model. This section explains the formulation of the ‘-th CRBM (i.e., CRBM-‘) where 1 ‘ L and the basic theory remains the same for all other CRBMs in the hierarchy. More formally, CRBM-‘ involve two sets of binary stochastic hidden layers h ð‘Þ ¼ ðhð1‘Þ; hð2‘Þ; . . .; hðT‘‘ÞÞ and h ð‘þ1Þ ¼ ðhð1‘þ1Þ; hð2‘þ1Þ; . . .; hðT‘‘þþ11ÞÞ. Then, CRBM-‘ can model any distribution on f0; 1gT‘ by learning appropriate model parameter values that minimizes the following optimization problem for a given set of m training samples f h ð1‘Þ, h ð2‘Þ, . . ., h ðm‘Þg o oH‘ J‘ðhsð‘Þ; H‘Þ ¼ where H‘ ¼ ½Wð‘Þ; að‘Þ indicates the parameters of CRBM‘ among which Wð‘Þ ¼ ½wiðj‘Þ represent the interaction between i-th unit in the hidden layer h ð‘Þ and j-th unit in the hidden layer h ð‘þ1Þ, a ð‘Þ is the bias associated with hidden layer units in h ð‘þ1Þ. ln ½pðhðs‘Þ; H‘Þ is the loglikelihood of the given sample hðs‘Þ and is expressed as the logarithm of the probability value defined in Eq. (25). f ðwiðj‘ÞÞ is the quadratic barrier function to ensure nonnegativity restriction on the network weights of CRBM-‘. is the ‘1-regularization term for the sparse activation of the output hidden layer units of CRBM-‘. b‘, c‘ are the weight penalty term and the sparse hyper-parameter of CRBM-l. These parameters are defined in the same way as it was done before in the case of CRSM. The stochastic gradient descent procedure is then applied to estimate the parameters of CRBM-‘. The derivative of Eq. (43) with respect to the model parameters H‘ for a given input sample h ðs‘Þ is given by: ð43Þ o h oH‘ ln pðhðs‘Þ; H‘Þ i o h XT‘ T‘þ1 þ a‘ X f ðwiðj‘ÞÞi oH‘ i¼1 j¼1 o h þ b‘ oH‘ f pðhð‘þ1Þ j hsð‘ÞÞ i ajð‘Þ ¼ aðj‘Þ þ g pðhjð‘þ1Þ ¼ 1 j h0Þ pðhjð‘þ1Þ ¼ 1 j hk þ c‘Oajð‘Þ Similar to CRSM, the parameter estimation of the CRBM-‘ is obtained by applying the contrastive divergence learning rule followed by a gradient descent step based on the derivative of the sparse regularization term and nonnegativity constraint (refer ‘‘Appendix B’’ for more details). Then, for an input sample h ðs‘Þ from the training set (i.e., h 0 ¼ hsð‘Þ) the parameter update rules of CRBM-‘ becomes: wiðj‘Þ ¼ wiðj‘Þ þ g pðhjð‘þ1Þ ¼ 1 j h0Þhi0 pðhjð‘þ1Þ ¼ 1 j hkÞhik þ b‘ddwiðj‘Þee þ c‘Owiðj‘Þ ð44Þ ð45Þ ð46Þ Owiðj‘Þ and Oajð‘Þ are provided in ‘‘Appendix B’’. where the complete description of the terms ddwiðj‘Þee , 4.1.3 HDLA Model Training The layer-wise learning procedure already mentioned in Algorithm 2 is extended to learn the parameters of the proposed HDLA model. By using the layer-wise strategy, the learning process of the proposed HDLA model is broken down into a number of related sub-tasks such that all of them can be completed in a stage-by-stage fashion. The basic idea here is to gradually present input observations to the HDLA model so that at the early stages of training the coarse-scale properties of input observations are captured while the fine-scale characteristics are learned in later stages. After training each layer, its output is considered as the input for training the next layer. That is, the output of each layer serves as a prior for learning the parameters of the next higher layer. The entire procedure for training the proposed HDLA model is summarized in Algorithm 3. Initially, the parameters of CRSM module which takes the BoVW representation of each training image as input are optimized using one-step contrastive divergence algorithm with the update rules specified in Eqs. (40)–(42). Then, we freeze the obtained parameters of CRSM and its hidden layer configuration for the given input observations is inferred. These inferred values then act as the input data for CRBM-1 in the next higher level of the hybrid deep learning architecture. Again, the one-step contrastive divergence algorithm with the value ‘ ¼ 1 and the modified update rules specified in Eqs. (45) and (46) are used to derive the parameters of CRBM-1. This procedure is repeated until the parameters of CRBM-L in the hierarchy are learned. To account for the topdown and bottom-up interaction terms, the structure of the HDLA model is altered while training according to the strategy already illustrated in Sect. 3.3.1. Finally, these parameters are composed together to form the required HDLA model. 5 HDLA-Based Image Representation This section describes how to learn a latent topic-based representation suitable for image retrieval from the trained HDLA model. Furthermore, the distance metric used to estimate the semantic similarity between images is also discussed. 5.1 Encoding of Previously Unseen Images Once the model parameters of HDLA are learned from an appropriate set of training samples, the given query and the database images can be mapped into the latent topic space for the purpose of image retrieval. The conceived HDLA model with L hidden layers generates a latent topic-based representation pðhL j vtestÞ for every input image whose BoVW representation is vtest. The activation pðhL j vtestÞ of the topmost hidden layer of HDLA denotes the latent topic structure of the given image and is taken as the feature vector for the desired retrieval operation. 5.2 Image Similarity Measure To use the features generated by the proposed hybrid deep learning architecture for image retrieval, an appropriate similarity measure has to be defined which efficiently estimates the correspondence between images characterized by their latent topic distribution. In recent years, deep learning-based models for document classification and retrieval use Jensen–Shannon (JS) divergence as the similarity metric, and found to yield good performance in terms of classification and retrieval accuracy [ 21 ]. This motivates the use of JS divergence as the similarity metric in the proposed work. Given the query J q and the database image J d, let the K-dimensional latent topic-based representation obtained with the proposed HDLA model is denoted by fq and fd. Then, the Jensen–Shannon divergence similarity measure JSðfq; fdÞ for estimating the similarity between two latent topic-based distributions fq and fd and is formally defined as follows: 1 JSðfq; fdÞ ¼ 2 KL fq; fq þ2 fd þ KL fd; fq þ fd 2 ð47Þ ð48Þ where KLðfq; fdÞ is expressed as: KLðfq; fdÞ ¼ K X fqilog i¼1 fqi! fdi where fqi and fdi, respectively, denote the i-th bin of the feature vectors fq and fd. 6 Performance Evaluation and Discussion The experimental validation of the formulated model is demonstrated in this section. Firstly, a short description of the datasets used for evaluation is provided. Then, the quantitative evaluation of the proposed HDLA model in terms of its generalization ability is carried out. Finally, the retrieval efficiency of the latent topic-based image representation obtained with the proposed HDLA model is compared with state-of-the-art approaches. 6.1 Datasets Used In the past, a number of benchmark datasets having ground truth images for a set of predefined queries have been introduced for evaluating different CBIR frameworks. Among them, six image collections with contrasting characteristics are selected to use in our retrieval experiments, and this section provides a detailed description of all these image collections. INRIA Holiday dataset [ 30 ] It involves 1491 high-resolution images of various places situated all over the universe. Images in this collection have a resolution of either 570 760 or 1020 760 and it mainly includes natural scene types. Among them, 500 images are reserved as queries and there exist predefined retrieval lists for each of the queries. Scene-15 dataset [ 31 ] There are mainly 4485 images in this collection and are grouped into 15 concept categories. In total, 210 to 410 images are there in each category and all of them have a fixed resolution equal to 250 300 pixels. Most of the images in the Scene-15 collection have distinguishing background and foreground context. Therefore, this image collection serves as a good choice for evaluating context-aware semantic image modeling schemes for the task of CBIR. Oxford dataset [ 32 ] This benchmark dataset comprises 5062 building images located at 11 various landmarks of the Oxford city, and it is difficult to distinguish similar building facades from one another. All images in the collection have a fixed resolution of 1020 760. The ground truth includes five images from each of the 11 landmarks and their corresponding search results. That is, 55 queries are there to evaluate the effectiveness of any retrieval system. GHIM-10K dataset [ 33 ] There are 10,000 images in the GHIM-10K dataset which spread over 20 concept categories. Each category contains 500 color images in JPEG format with a resolution of 300 400 or 400 300. Those images in the search result that belongs to the semantic category similar to the given query are judged as relevant. That is, a randomly selected image from any of these 20 concept classes can act as the query and there are exactly 499 relevant images in the collection. IAPR TC-12 dataset [ 34 ] Another widely used image collection selected for retrieval evaluation is the IAPR TC12 dataset. It involves 20,000 images collected from various locations around the globe comprising different types of natural scene images. All images in this collection are in JPEG format with a fixed size of 360 480 pixels. An interesting property of this image collection is that there are many images having identical visual content; however, they differ in background, lighting conditions and viewing position. MIRFLICKR-40K dataset [ 35 ] The final image collection selected for evaluation is the MIRFLICKR-40K dataset and is a subset of the MIRFLICKR-1M collection. This dataset comprises 40,000 images and all of them have a fixed resolution of 720 480. The notable characteristic of this image collection is that it exhibits semantic diversity by having images belonging to multiple categories and varying appearance. Thus, the MIRFLICKR-40K dataset provides an in-depth analysis of any image retrieval framework due to its moderate size and heterogeneous content. An ideal topic modeling scheme should adequately model the given data samples and at the same time has the potential to yield semantically coherent latent topics. Therefore, it is necessary to analyze these two aspects of the proposed model while judging its competence. To do so, two sets of experiments are carried out using the proposed model. The first one is the generalization test on unseen data samples, and the other one is the evaluation of reconstruction error for a standard handwritten image collection. In all these experiments, the performance of the proposed model is compared with the following baseline approaches such as Over Replicated Softmax Model (ORSM) [ 21 ], Replicated Softmax Model (RSM) [ 6 ] and Rate Adapting Poisson model (RAP) [ 5 ]. The hardware platform for simulating the proposed HDLA model is an Intel Core i7-4570 machine equipped with 3.4 GHz CPU and 16 GB of RAM. The HDLA model is coded in MATLAB R2016b(9.1) environment under Unix operating system. For all the experiments presented in this paper, the proposed HDLA model is trained for 200 epochs with a learning rate g = 0.2. The visible and hidden layer biases are initialized with small random values, and the model weights are randomly chosen from positive values in the range [ 0,1 ]. It is found that k=1 is sufficient for the contrastive divergence algorithm to generate good latent topic-based features. Since topic models are trained in a completely unsupervised fashion, it is difficult to evaluate the competence of one model over the other. In practice, the performance of topic models is evaluated using their generalization ability on unseen data sample. More specifically, estimating the likelihood of a held-out data set provides a clear, interpretable metric for evaluating the performance of topic models relative to other existing models. The log-likelihood and the perplexity scores are the commonly used metrics to quantify the generalization ability of topic models. Let vtest denote the BoVW-based representation of an input image, then the HDLA model assign a probability pðvtestÞ ¼ Ph Pðvtest; hÞ to the visible vector vtest. However, in practice, it is computationally intractable because of the sum of an exponential number of terms. Therefore, we rely on sampling to compute the loglikelihood values as follows: log hpðvtestÞi ¼ log h 1n Xn pðvtest j hðtÞÞi t¼1 ð49Þ where fhð1Þ; hð2Þ; . . .; hðnÞg is a set of n samples drawn from Pðvtest; hÞ by means of Gibbs sampling. Then, the average test perplexity value is computed as: perplexityðJ testÞ ¼ exp D 1 Xjj 1 jDj i¼1 Ni log pðvtðeiÞstÞ ð50Þ where J test is the given collection of test images, |D| is the number of images in the collection J test, Ni and vtðeiÞst, respectively, denotes the number of interest points and the visual word-count vector for the i-th image in the collection J test. From this definition, one can see that a low perplexity score always indicates a better generalization performance. We conducted log-likelihood and perplexity analysis by experimenting on all the six data sets considered for evaluation. HDLA model with three hidden layer units (i.e., L = 3) is used in this experiment. The visible layer of the proposed model accepts BoVW-based representation of input images and then maps the input to latent topic space. The log-likelihood and perplexity values are calculated by running the Gibbs sampler three times each with 1000 iterations and then by taking the average of these three scores. Tenfold cross-validation is performed in all the six datasets considered for evaluation. That is, images in the individual dataset are grouped into tenfolds of approximately equal sizes. Special care has been taken to ensure that there is no overlap between images belonging to each fold. Then, in each run of the experiment, ninefolds are used for model training, and the remaining onefold is used for testing the model. For different sizes (K) of the visual dictionary, the total log-likelihood values obtained for each of the compared models by varying the number of latent topics (TL¼3) are summarized in Table 2. From these results, it can be concluded that the proposed model outperforms other existing models in terms of its generalization performance. Next, the convergence property of the hybrid deep learning model is analyzed. To this end, a series of experiments have been carried out to see whether the proposed topic modeling scheme converges at a rate faster than state-of-the-art approaches. Figure 8 depicts the perplexity values of individual models as a function of the number of iterations when applied to all the six image data sets. In all these experiments, K and TL¼3 values are selected in such a way that gives better generalization performance. The obtained results revealed the fact that the perplexity values of the formulated model consistently decrease in successive iterations and it achieves a faster rate of convergence as compared to other models. In conclusion, the effectiveness of a given topic modeling scheme entirely depends on its generalization ability and which in turn directly related to the number of training iterations. There is always an upper limit beyond which an increase in the number of iteration has no effect on the model’s generalization power. It is evident from the above results that the generalization power of the existing models is not up to the mark even for a substantially large number of training iterations. However, the proposed HDLA model outperforms the widely used baseline models in terms of its generalization ability and convergence rate. That is, HDLA model attains better generalization power within a lesser number of training iterations. Therefore, the HDLA-based formulation is capable of yielding a semantic-based image representation having more discriminative power. 6.2.3 Reconstruction Performance To further evaluate the effectiveness of the obtained latent topic-based representation, the hybrid deep learning architecture is applied to model images of handwritten digits. The performance of the proposed model is then measured in terms of reconstruction error, which is defined as the average pixel differences between the original and reconstructed images. The Reconstruction Error (RE) for a given image J is calculated as follows: 1 Xd REðJ Þ ¼ d j¼1 ðJ j 6¼ Je jÞ ð51Þ where d the dimensionality of the vectorized version of each input image J and Je is the reconstructed value of J by the learned model. The MNIST handwritten digit dataset [ 36 ] is used as the benchmark for experimental evaluation. This dataset contains 60,000 training and 10,000 test images for each of the 10 (0 to 9) digits. Each handwritten digit is a 28 28-pixel gray level image. Hence, the visible layer of the proposed model contains 28 28 = 784 nodes. Initially, the pixel values (0-255) of all input images are mapped to 0 or 1. For this, a threshold value of 30 is selected, and pixel values greater than or equal to 30 are set to 1 while values less than 30 are set to 0. A given image in its vectorized binary form is reconstructed by sampling the top most hidden layer vector from the latent model under evaluation followed by sampling the visible vector based on the generated hidden vector. The resulting visible vector is multiplied by 255 and is then binarized by the same s S t R e s 4 7 4 1 8 41 8 8 6 1 8 3 8 4 .635 .58 83 .36 .588 .86 .42 .586 .848 .366 .387 6 P .285 .174 .563 .64 .373 .461 .066 .651 .661 .066 .551 .357 P 11 02 .9 7 0 5 4 7 9 8 7 .8 A 1 8 8 1 9 8 A R R 2 7 1 6 3 2 3 2 3 3 2 M .25011 .8369 .3238 .9418 .82101 .2878 .9167 .9327 .3188 .8308 .9127 .3996 SRM .077 .546 .036 .2285 .096 .795 .155 .225 .216 .265 .005 .184 ttrse e l d e ltlcaeacud M .7828 .3877 .8437 .0896 .8957 .4817 .8176 .9436 .4496 .0956 .6726 .4395 SRM .3938 .1816 .8633 .7349 .7928 .6722 .5620 .2654 .7653 .8634 .3652 .9547 trepoph s o S y Q r a 2 n o le ti b c 0 a i 5 T D 2 l S e R d o e z i S y r a n o i t c 0 i 5 D 2 0 0 5 0 5 7 n i l e d o f o f o m m u s e l b a t e h T RAP RSM ORSM HDLA (Proposed) RAP RSM ORSM HDLA (Proposed) RAP RSM ORSM HDLA (Proposed) 3200 3000 2800 2600 y t ix2400 e l rp2200 e P 2000 1800 1600 7500 7000 6500 6000 y t ix5500 e l rp5000 e P 4500 4000 3500 10000 9000 8000 y7000 t i x le 6000 p r e P5000 4000 3000 900 1100 1300 1500 Number of Iterations (a) Holiday datset RAP RSM ORSM HDLA (Proposed) RAP RSM ORSM HDLA (Proposed) RAP RSM ORSM HDLA (Proposed) Table 3 Evaluation of the reconstruction performance of the proposed HDLA model Number of RBM units No of training samples Model configuration Reconstruction error (%) ORSM HDLA 3 4 30,000 60,000 30,000 60,000 procedure described above. To deal with binary inputs, the RSM unit in the first layer of the proposed HDLA model shown in Fig. 7 is replaced with an RBM unit. In our experiments, different configurations of the proposed HDLA model are trained for the purpose of reconstructing MNIST handwritten digit images. The performance of the proposed HDLA model is then evaluated in comparison with Over Replicated Softmax Model (ORSM). Instead of directly using the actual training and test sets, the entire data set is pooled into ten equal-sized subsets. One of this subset is then used for model evaluation, and the remaining nine subsets are used for model training. This process is repeated ten times rotating through all the subsets which lead to tenfold cross-validation results. The obtained values are summarized in Table 3. From these results, it is evident that HDLA is a good generative model and it can significantly minimize the reconstruction error as compared to the ORSM-based formulation. Another factor to take into account is the impact of the number of training samples on the performance of HDLA and ORSM. Therefore, experiments are conducted by varying the number of training samples for each configuration of HDLA and ORSM. In all such cases, it seems that the proposed HDLA framework exhibits better reconstruction performance and is less sensitive to the size of training set as compared to ORSM. 6.3 Evaluation of HDLA-Based Image Search This section evaluates the retrieval effectiveness of the proposed HDLA model in comparison with other latent topic-based approaches. The following subsections delineate the performance measures employed to judge the retrieval results, the procedure used to select appropriate values for the model parameters of HDLA in connection with effective image retrieval and the search results of the retrieval experiments carried out in various datasets. 6.3.1 Evaluation Metrics The primary objective of a typical CBIR system is to generate a ranked list of top k images from the given dataset in response to a submitted query. The rank of an image is determined by its relevance to the query at hand. To be able to compare various image retrieval models, first a set of performance measures are to be identified. When the ground truth of the data set is available, the system’s performance is generally measured in terms of quantitative metrics such as precision and recall. The precision of a retrieval system measures the percentage of relevant images in the ranked retrieval list and the recall denotes the percentage of relevant images retrieved by the system. These two metrics are defined as follows: Number of relevant images retrieved Precision ¼ Total number of images retrieved ð52Þ Number of relevant images retrieved Recall ¼ Total number of relevant images in the set ð53Þ Precision and recall do not take into account the order in which relevant images appear in the ranked retrieval list. When two retrieval systems have the same precision and recall values, the system that ranks relevant images higher is mostly preferred. In order to solve this issue, measures like Precision at rank position k (p@k) and R-precision are introduced. p@k is the value of precision calculated using the first k documents in the retrieval list. Similarly, RPrecision for a given query is defined to be the precision after retrieving R images from the image data base and is expressed as: 1 XR Precision ¼ R where R is the total number of relevant images in the database for the given query and Rel(j) is an indicator function which returns the value 1 when the image present at the j-th location of the retrieval list is relevant with respect to the given query. Moreover, precision can be expressed as a function of recall. The interpolated precision recall graph plots precision as a function of recall and can be used to assess the overall performance of the retrieval framework. The interpolated precision pint at a recall level ri is calculated as the largest observed precision for any recall value r between ri and riþ1: PintðriÞ ¼ ri mr axriþ1 PrecisionðrÞ An alternative single valued evaluation metric is the mean average precision (mAP). For a set of m query images the Mean Average Precision is defined as: 1 XQ Mean Average Precision ðmAPÞ ¼ m i¼1 APðqÞ ð56Þ where AP(q) is the average precision for a given query q and is defined as the ratio of the sum of precision values from rank positions where a relevant image is found in the retrieval result to the total number of relevant images in the database. One last metric is the Average Retrieval Rate (ARR) which is defined as: 1 XNQ Average Retrieval RateðARRÞ ¼ NQ q¼1 RRðqÞ where NQ represents the number of queries used for evaluating the retrieval system. RR(q) is the retrieval rate for a single query q and is calculated as: ð54Þ ð55Þ ð57Þ ð58Þ RRðqÞ ¼ NRða; qÞ NGðqÞ where NGðqÞ is the number of ground truth images of a query q and NRða; qÞ indicates the number of relevant images found in the first a NGðqÞ images. The value of a should be greater than or equal to 1. Selecting larger a values would be less discriminative between very good retrieval results and those retrieval results that are not so good ones. Hence, in this work the value of a is fixed as 1.5. Another important metric to evaluate the quality of a retrieval result is the normalized Discounted Cumulative Gain (nDCG). The intuition underlying nDCG is that an end user mainly interested in the top positions of the retrieval list and less likely to explore the lower-ranked images. To incorporate this notion in the evaluation metric, nDCG follows a graded correctness score. The correctness cri of an image i in the retrieval list varies within the range 0–3 according to user judgement, where 0 corresponds to irrelevant images and 3 corresponds to the most relevant image. Based on the correctness score, the usefulness or gain of each image with respect to its position p in the retrieval list is estimated and is then accumulated to compute the nDCG value as follows: DCGp nDCGp ¼ IDCGp where DCGp is the discounted cumulative gain and IDCGp is the ideal DCG value at rank list position p and are, respectively, defined as follows: ð59Þ ð60Þ ð61Þ DCGp ¼ IDCGp ¼ p X 2cri i¼1 log2ði þ 1Þ jRNj 2cri X i¼1 log2ði þ 1Þ 1 1 where jRN j is the number of images in the retrieval list Rq sorted in descending order of their correctness score up to rank position p. The logarithmic factor in the denominator is a penalty term by which a discount is made to the correctness value of highly relevant images appearing at the bottom position of the search result. Finally, the nDCG values of all the queries are averaged to get the overall performance of the retrieval system. 6.3.2 Parameter Selection In the context of image retrieval, it is important to select appropriate values for the parameters of HDLA model. More specifically, the parameters such as visual dictionary size (K), the number of hidden layers and the number of nodes in each hidden layers need to be tuned for good retrieval performance. For individual image collection, this is done by calculating the average retrieval rates for each query set by varying the visual dictionary size and the number of nodes in each hidden layers of HDLA. Figure 9 depicts the average retrieval rates obtained by different image collections while changing the number of hidden layer units along with visual dictionary size. It is now easy to fix reasonable values for the model parameters by analyzing the results shown in Fig. 9. Once the proper estimates of these parameters have been obtained, they can be frozen and used for subsequent retrieval experiments. To avoid computational bottlenecks, HDLA model with three layers of hidden units are considered in our retrieval experiments. It is empirically found that HDLA model with 0.3 500 550 600 650 700 750 800 850 900 950 Dictionary Size (K ) (c) Oxford dataset 800 900 Dictionary Size (K ) 1000 0.45 500 550 600 650 700 750 800 850 900 950 Dictionary Size (K ) (a) Holiday datset 0.4 750 800 850 900 950 1000 1050 1100 1150 1200 Dictionary Size(K ) 0.5 900 950 1000 1050 1100 1150 1200 1250 1300 1350 Dictionary Sie(K ) (e) IAPR TC-12 dataset (f ) MIRFLICKR-40K dataset three layers of hidden units is good enough to generate latent topic-based image representation having more discriminative power and retrieval accuracy than the existing topic modeling schemes. The next subsection summarizes the comparative evaluation of various image retrieval experiments. 6.3.3 Retrieval Results and Discussion This section verifies the retrieval efficiency of the proposed scheme in comparison with state-of-the-art models. In this regard, the following retrieval frameworks have been selected for comparison purpose, namely, Over Replicated Softmax Model (ORSM) [ 21 ], Replicated Softmax model (RSM) [ 6 ], Rate Adapting Poisson model (RAP) [ 5 ], Pachinko Allocation Model (PAM) [ 15 ] and Latent Dirichlet Allocation (LDA) [ 2 ]. The retrieval effectiveness of the proposed HDLA model is initially evaluated on the basis of mAP, average R-Precision and nDCGp¼10 values. The comparison of the proposed model and the already existing methods is provided in Table 4. On average, the HDLA model achieves 6% improvement in the values of mAP, average R-Precision and nDCGp¼10 as compared to the best performing approach in the literature. From these statistics, it is evident that the proposed HDLA model is promising and it gives better retrieval results compared to state-of-the-art methods. Figure 10 shows the 11-point interpolated average precision values obtained for the proposed HDLA-based image search in comparison with other latent topic-based retrieval strategies. From these results, it can be concluded that the precision achieved with the proposed HDLA-based image representation is obviously better than the existing models across all values of recall for all image collections selected for evaluation. To further validate the effectiveness of the proposed HDLA model, its performance is compared with other existing models in terms of the average precision values at selected rank thresholds of 10, 20 and 30 (i.e, p@10, p@20 and p@30). The average precision values of the retrieval experiments carried out in all the benchmark datasets are 0.3 0.4 0.5 0.6 Recall (a) Holiday datset 00 HDLA (Proposed) ORSM RSM RAP PAM LDA HDLA (Proposed) ORSM RSM RAP PAM LDA 00 MIRFLICKR-40K dataset Fig. 10 Evaluation of the proposed HDLA-based image retrieval framework in comparison with state-of-the-art approaches based on 11-point interpolated average precision. a Holiday datset. b Scene-15 dataset. c Oxford dataset. d GHIM-10K dataset. f MIRFLICKR-40K dataset dataset. e IAPR TC-12 The performance figures of the proposed HDLA model are shown in boldface letters LDA presented in Table 5. When an end user is interested in viewing only the top 10, 20 and 30 results returned by the retrieval model, then 6% improvement on average is achieved with the proposed HDLA-based formulation. To conclude, the hybrid deep learning architecture proposed in this paper yields compact but discriminative image representation well suited for the retrieval operation. All the retrieval experiments substantiate the ability of the proposed HDLA model in discovering latent topics by grouping semantically similar visual words to characterize images at a much higher-level of abstraction. The abovementioned experimental results validate the potential of HDLA-based formulation to bridge the semantic gap in image understanding and retrieval. 7 Conclusion In this paper, a new class of topic modeling scheme called hybrid deep learning architecture is proposed for semantic image modeling and retrieval. The proposed architecture is a composite of Replicated Softmax Model and Restricted Boltzmann Machines with nonnegativity restriction on the network weights and ‘1-sparseness constraint on the hidden layer activations. As part of image modeling, the formulated architecture infers a hierarchical nonlinear mapping function in a completely unsupervised fashion that projects the original BoVW-based representation on to a latent topic-based semantic concept space. Thus, the hybrid deep learning architecture can capture semantic correlation among visual words and consequently minimizes the semantic loss associated with BoVW-based image retrieval. Based on the experimental evaluations it can be concluded that the image representation yielded by the proposed HDLA model significantly improves the retrieval performance as compared to state-of-the-art latent topicbased image retrieval systems. Compliance with Ethical Standards Conflict of interest The authors declare that they have no competing interests. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creative commons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Appendix A Gradient of the Regularization Terms for CRSM The gradient of the quadratic barrier function with respect to the network weights of CRSM is computed as follows: o owij f ðwijÞ ¼ ddwijee 2wij; 0; wij\0 wij 0 the gradient exists only for negative weights and in all other cases its value is zero. As the definition of the quadratic barrier function is free from the bias terms, its gradient with respect to b and að1Þ will be zero. Similarly, the gradient of the sparse regularization term with respect to the model parameters of CRSM is computed as follows: 1 þ exp 1 K P wijui i¼1 Naj with respect to the model parameters of CRBM-‘ is computed as: 1 T‘ iP¼1 wiðj‘Þhjð‘Þ ajð‘Þ ! pðhjð1Þ ¼ 1 j UsÞ p2ðhjð1Þ ¼ 1 j UsÞ ð66Þ ð67Þ pðhjð‘þ1Þ ¼ 1 j hðs‘ÞÞ p2ðhjð‘þ1Þ ¼ 1 j hsð‘ÞÞ PT‘ wiðj‘Þhjð‘Þ i¼1 T‘ iP¼1 wiðj‘Þhjð‘Þ pðhjð‘þ1Þ ¼ 1 j hsð‘ÞÞ p2ðhjð‘þ1Þ ¼ 1 j hsð‘ÞÞ 1 þ exp 1 T‘ iP¼1 wiðj‘Þhjð‘Þ ajð‘Þ o owij ¼ Mwij o oaj ¼ Maj Since the activation probability of the hidden units in CRSM is independent of the visible layer bias term b, the gradient of the sparse regularization part with respect to b will be zero. Appendix B Gradient of the Regularization Terms for CRBM-l o owiðj‘Þ f ðwiðj‘ÞÞ ¼ ddwiðj‘Þee The gradient of the quadratic barrier function with respect to the network weights of CRBM-‘ is computed as follows: 8 < 2wiðj‘Þ; wiðj‘Þ\0 : 0; wiðj‘Þ 0 ð65Þ The gradient of the quadratic barrier function with respect to the parameter að‘Þ is zero because the definition of the nonnegativity constraint does not involve any bias terms. Similarly, the gradient of the sparse regularization term f pðhð‘þ1Þ j hðs‘ÞÞ exp T‘ iP¼1 wiðj‘Þhið‘Þ T‘ iP¼1 wiðj‘Þhjð‘Þ o owiðj‘Þ ¼ hjð‘Þ ¼ Owiðj‘Þ o oajð‘Þ ¼ Oajð‘Þ exp 1 þ exp o owiðj‘Þ ajð‘Þ ajð‘Þ o ¼ oajð‘Þ ajð‘Þ ajð‘Þ 2 2 ð63Þ ð64Þ 1. Hofmann T ( 2001 ) Unsupervised learning by probabilistic latent semantic analysis . Mach Learn 42 ( 1 ): 177 2. Blei DM , Ng AY , Jordan MI ( 2003 ) Latent dirichlet allocation . J Mach Learn Res 3 (January): 993 3. Blei DM , Lafferty JD ( 2005 ) Correlated topic models . In: Proceedings of the 18th international conference on neural information processing systems , MIT Press, Cambridge, MA, pp 147 - 154 4. Boulemden A , Tlili Y ( 2012 ) Image indexing and retrieval with pachinko allocation model: application on local and global features . In: Proceedings of the 12th pacific rim conference on knowledge management and acquisition for intelligent systems . Springer, Berlin, pp 140 - 146 5. Gehler PV , Holub AD , Welling M ( 2006 ) The rate adapting poisson model for information retrieval and object recognition . In: Proceedings of the 23rd international conference on machine learning. ACM , New York, pp 337 - 344 6. Salakhutdinov R , Hinton G ( 2009 ) Replicated softmax: an undirected topic model . In: Proceedings of the 22nd international conference on neural information processing systems . Curran Associates Inc., USA, pp 1607 - 1614 7. Deerwester S , Dumais ST , Furnas GW , Landauer TK , Harshman R ( 1990 ) Indexing by latent semantic analysis . J Am Soc Inf Sci 41 ( 6 ): 391 8. Pecenovic Z ( 1997 ) Intelligent image retrieval using latent semantic indexing . Master's thesis , Swiss Federal Institute of Technology 9. Zhang R , Zhang Z ( 2007 ) Effective image retrieval based on hidden concept discovery in image database . IEEE Trans Image Process 16 ( 2 ): 562 10. Lienhart R , Romberg S , Ho¨rster E ( 2009 ) Multilayer pLSA for multimodal image retrieval . In: Proceedings of the ACM international conference on image and video retrieval. ACM , New York 11. Li P , Cheng J , Li Z , Lu H ( 2011 ) Correlated PLSA for image clustering . In: Advances in multimedia modeling , pp 307 - 316 12. Chiang CC , Wu JW , Lee GC ( 2012 ) Probabilistic semantic component descriptor . Multimed Tools Appl 59 ( 2 ): 629 13. Ho¨rster E, Lienhart R , Slaney M ( 2007 ) In: Proceedings of the 6th ACM international conference on image and video retrieval . ACM, New York, pp 17 - 24 14. Greif T , Ho¨rster E, Lienhart R ( 2008 ) Correlated topic models for image retrieval . University of Augsburg, Germany, July, Tech. rep 15. Li W , McCallum A ( 2006 ) Pachinko allocation: DAG-structured mixture models of topic correlations . In: Proceedings of the 23rd international conference on machine learning , ACM, New York, pp 577 - 584 16. Andrieu C , De Freitas N , Doucet A , Jordan MI ( 2003 ) An introduction to MCMC for machine learning . Mach Learn 50 ( 1-2 ): 5 17. Minka T , Lafferty J ( 2002 ) Expectation-propagation for the generative aspect model . In: Proceedings of the eighteenth conference on uncertainty in artificial intelligence , Morgan Kaufmann Publishers Inc., pp 352 - 359 18. Casella G , George EI ( 1992 ) Explaining the Gibbs sampler . Am Stat 46 ( 3 ): 167 19. Hinton GE , Salakhutdinov RR ( 2006 ) Reducing the dimensionality of data with neural networks . Science 313 ( 5786 ): 504 20. Hinton G ( 2010 ) A practical guide to training restricted Boltzmann machines . Momentum 9 ( 1 ): 926 21. Srivastava N , Salakhutdinov R , Hinton G ( 2013 ) Modeling documents with a deep boltzmann machine . In: Proceedings of the twenty-ninth conference on uncertainty in artificial intelligence . AUAI Press, Arlington, Virginia, pp 616 - 624 22. Olshausen BA , Field DJ ( 2004 ) Sparse coding of sensory inputs . Curr Opin Neurobiol 14 ( 4 ): 481 23. Salakhutdinov R , Hinton G ( 2009 ) Deep boltzmann machines . In: Proceedings of the twelfth international conference on artificial intelligence and statistics , Clearwater Beach, Florida, pp 448 - 455 24. Brooks S , Gelman A , Jones GL , Meng XL ( 2011 ) Handbook of markov chain monte carlo . CRC Press, Boca Raton 25. Hinton GE , Salakhutdinov RR ( 2012 ) A better way to pretrain deep boltzmann machines . In: Proceedings of the 26th annual conference on neural information processing systems . Lake Tahoe , Nevada, pp 2447 - 2455 26. Bruna J , Mallat S ( 2013 ) Invariant scattering convolution networks . IEEE Trans Pattern Anal Machine Intelligence 35 ( 8 ): 1872 27. Lee DD , Seung HS ( 1999 ) Learning the parts of objects by nonnegative matrix factorization . Nature 401 ( 6755 ): 788 28. Poggio T , Girosi F ( 1998 ) A sparse representation for function approximation . Neural Comput 10 ( 6 ): 1445 29. Nguyen TD , Tran T , Phung DQ , Venkatesh S ( 2013 ) Learning parts-based representations with nonnegative restricted boltzmann machine . In: Proceedings of the Asian conference on machine learning. ACT, Canberra , pp 133 - 148 30. Jegou H , Douze M , Schmid C ( 2008 ) Hamming embedding and weak geometric consistency for large scale image search . In: Proceedings of the 10th European conference on computer vision: Part I . Springer, Berlin, pp 304 - 317 31. Lazebnik S , Schmid C , Ponce J ( 2006 ) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories . In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition . Vol 2, IEEE Computer Society, Washington, DC, pp 2169 - 2178 32. Philbin J , Chum O , Isard M , Sivic J , Zisserman A ( 2007 ) Object retrieval with large vocabularies and fast spatial matching . In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE , New York, pp 1 - 8 33. Liu GH , Yang JY , Li Z ( 2015 ) Content-based image retrieval using computational visual attention model . Pattern Recogn 48 ( 8 ): 2554 34. Grubinger M , Clough P , Mu¨ller H, Deselaers T ( 2006 ) The iapr tc-12 benchmark: a new evaluation resource for visual information systems . In: Proceedings of international conference on language resources and evaluation . vol 5 , ELRA, 2006 , vol 5 , p 10 35. Huiskes MJ , Thomee B , Lew MS ( 2010 ) New Trends and ideas in visual concept detection: the MIR Flickr retrieval evaluation initiative . In: Proceedings of international conference on multimedia information retrieval. ACM , New ork, pp 527 - 536 36. Deng L ( 2012 ) The MNIST database of handwritten digit images for machine learning research [best of the web] . IEEE Signal Process Mag 29 ( 6 ): 141


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs41019-018-0063-7.pdf

K. S. Arun, V. K. Govindan. A Hybrid Deep Learning Architecture for Latent Topic-based Image Retrieval, Data Science and Engineering, 2018, 1-30, DOI: 10.1007/s41019-018-0063-7