State-dependent swap strategies and automatic reduction of number of temperatures in adaptive parallel tempering algorithm (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs11222-015-9579-0.pdf

State-dependent swap strategies and automatic reduction of number of temperatures in adaptive parallel tempering algorithm

Stat Comput DOI 10.1007/s11222-015-9579-0 State-dependent swap strategies and automatic reduction of number of temperatures in adaptive parallel tempering algorithm Mateusz Krzysztof Ła˛cki1 · Błażej Miasojedow2 Received: 12 June 2014 / Accepted: 15 May 2015 © The Author(s) 2015. This article is published with open access at Springerlink.com Abstract In this paper we present extensions to the original adaptive Parallel Tempering algorithm. Two different approaches are presented. In the first one we introduce statedependent strategies using current information to perform a swap step. It encompasses a wide family of potential moves including the standard one and Equi-Energy type move, without any loss in tractability. In the second one, we introduce online trimming of the number of temperatures. Numerical experiments demonstrate the effectiveness of the proposed method. Keywords Parallel tempering · Adaptive MCMC · Swapping strategies · Equi-Energy sampler 1 Introduction Markov chain Monte Carlo (MCMC) is a generic method to approximate an integral of the form I := Rd f (y)π(y)dy, Electronic supplementary material The online version of this article (doi:10.1007/s11222-015-9579-0) contains supplementary material, which is available to authorized users. B Błażej Miasojedow ; Mateusz Krzysztof Ła˛cki 1 Institute of Informatics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland 2 Institute of Applied Mathematics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland where π is a probability density function, which can be evaluated point-wise up to a normalising constant. Such an integral occurs frequently when computing Bayesian posterior expectations (Robert and Casella 1999; Gilks et al. 1998). The random walk Metropolis algorithm (Metropolis et al. 1953) often works well, provided the target density π is, roughly speaking, sufficiently close to unimodal. The efficiency of the Metropolis algorithm can be optimised by a suitable choice of proposal distribution. These, in turn, can be chosen automatically by several adaptive MCMC algorithms; see Haario et al. (2001), Atchadé and Rosenthal (2005), Roberts and Rosenthal (2009), Andrieu and Thoms (2008) and references therein. When π has multiple well-separated modes, the random walk-based methods tend to stuck in a single mode for long periods of time. It can lead to false convergence and severely erroneous results. Using a tailored Metropolis-Hastings algorithm can help, but, in many cases, finding a good proposal distribution is not easy. Tempering of π , that is, considering auxiliary distributions with density proportional to π β with β ∈ (0, 1), often provides better mixing between modes (Swendsen and Wang 1986; Marinari and Parisi 1992; Hansmann 1997; Woodard et al. 2009; Neal 1996). We focus here particularly on the parallel tempering algorithm, which is also known as the replica exchange Monte Carlo and the Metropolis-coupled Markov chain Monte Carlo. The tempering approach is particularly tempting in such settings where π admits a physical interpretation, and there is good intuition how to choose the temperature schedule for the algorithm. In general, choosing the temperature schedule is a nontrivial task, but there are generic guidelines for temperature selection based on both empirical findings and theoretical analysis (Kofke 2002; Kone and Kofke 2005; Atchadé et al. 123 Stat Comput 2011; Roberts and Rosenthal 2012). These theoretical findings were used to derive adaptive version of the Parallel Tempering (Miasojedow et al. 2013a). Another approach to temperature tuning can be found in (Behrens et al. 2012). This approach offers a different criterion for choosing temperature schedule and is developed for the Tempered Transitions algorithm (Neal 1996). In the present paper we consider the adaptive version of the Parallel Tempering algorithm. The adaption consists in introducing state-dependent swaps between differently tempered random walks. We study the impact of different distributions on potential steps and call them Strategies. Our choice of strategies is driven by solutions already known to the literature (Kou et al. 2006) and used within Parallel Tempering algorithm by Baragatti et al. (2013). The novelty of our approach stems from an alternative implementation of Equi Energy moves that renders the algorithm parameters free, i.e. the user does not need to provide precise Energy Rings any more. We also investigate different modifications of this new approach. We also propose an automated method for reducing the actual number of considered temperatures, in the spirit of Miasojedow et al. (2013a). The temperature adaptation scheme depends on the parameters of the adaptive random walks applied in the parallelised Metropolis-Hastings stage of the algorithm in case when the state space amounts to be the usual Rd . We have also showed that the proposed algorithm satisfies the Law of Large numbers, in the same setting as in Miasojedow et al. (2013a). 2 Definition and notations Our basic object of interest is the density π : Ω → R+ , where Ω = Rd . We assume we can evaluate point-wise a function that is proportional to π by some constant. The Parallel Tempering approach suggests to construct a Markov chain on the product space Ω L , where L is the number of temperature levels. On that space a new density π β is constructed by posing for x ∈ Ω L π β (x) = π β (x1 , . . . , x L ) ∝ π β1 (x1 ) × · · · × π β L (x L ) so that xi ∈ Ω, β = β1 , β2 , . . . , β L are the inverse temperatures subject to 1 = β1 > β2 > · · · > β L . Vector T = (T1 , . . . , TL ), where T = β−1 , contains numbers known as temperatures and is itself referred to as the temperature scheme. Density π β is known up to proportionality factor and by marginalising it w.r. to the first coordinate we retrieve the original distribution π . 123 Markov chain X = {X [k] }k≥0 targets π β and can be thought of as L coordinate chains, X [k] = X 1[k] , . . . , X [k] L . First coordinate chain will be referred to as the base (temperature) chain. The main idea behind Parallel Tempering is to interweave random walk steps with random swaps between chains. Each random swap exchanges results of a random walk step from two coordinate chains. Chains corresponding to higher temperatures1 should, in principle, be more volatile and travel between different modes more easily than chains linked to lower temperatures. For if x is the last visited place by the lth chain, and y is a proposal drawn from a region where the density assumes smaller values, π(y) < π(x), then the probability of accepting such proposal, that we call η , is higher on the more tempered chain2 η (x, y) = 1 ∧ π(y) π(x) βl >1∧ π(y) = η1 (x, y). π(x) Therefore, the more exchanges of higher tempered chains with the base chain, the bigger the chance of getting out from a local probability cluster where a simple Markov chain would stuck. The generation of Markov (...truncated)