iccv2019論文全集9703-gradient-based-adaptive-markov-chain-monte-carlo_第1頁
iccv2019論文全集9703-gradient-based-adaptive-markov-chain-monte-carlo_第2頁
iccv2019論文全集9703-gradient-based-adaptive-markov-chain-monte-carlo_第3頁
iccv2019論文全集9703-gradient-based-adaptive-markov-chain-monte-carlo_第4頁
iccv2019論文全集9703-gradient-based-adaptive-markov-chain-monte-carlo_第5頁
已閱讀5頁,還剩5頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

Gradient-based Adaptive Markov Chain Monte Carlo Michalis K. Titsias DeepMind London, UK mtitsias Petros Dellaportas Department of Statistical Science University College of London, UK Department of Statistics, Athens Univ. of Econ. and Business, Greece and The Alan Turing Institute, UK Abstract We introduce a gradient-based learning method to automatically adapt Markov chain Monte Carlo (MCMC) proposal distributions to intractable targets. We defi ne a maximum entropy regularised objective function, referred to as generalised speed measure, which can be robustly optimised over the parameters of the proposal dis- tribution by applying stochastic gradient optimisation. An advantage of our method compared to traditional adaptive MCMC methods is that the adaptation occurs even when candidate state values are rejected. This is a highly desirable property of any adaptation strategy because the adaptation starts in early iterations even if the initial proposal distribution is far from optimum. We apply the framework for learning multivariate random walk Metropolis and Metropolis-adjusted Langevin proposals with full covariance matrices, and provide empirical evidence that our method can outperform other MCMC algorithms, including Hamiltonian Monte Carlo schemes. 1Introduction Markov chain Monte Carlo (MCMC) is a family of algorithms that provide a mechanism for gen- erating dependent draws from arbitrarily complex distributions. The basic set up of an MCMC algorithm in any probabilistic (e.g. Bayesian) inference problem, with an intractable target density (x), is as follows. A discrete time Markov chainXt t=0with transition kernelP, appropriately chosen from a collection of-invariant kernelsP(,), is generated and the ergodic averages t(F) = t1 Pt1 i=0F(Xi)are used as approximations toE(F)for any real-valued functionF. Al- though in principle this sampling setup is simple, the actual implementation of any MCMC algorithm requires careful choice ofPbecause the properties oftdepend on. In common kernels that lead to random walk Metropolis (RWM), Metropolis-adjusted Langevin (MALA) or Hamiltonian Monte Carlo (HMC) algorithms the kernelsP are specifi ed through an accept-reject mechanism in which the chain moves from timetto timet + 1 by fi rst proposing candidate valuesyfrom a density q(y|x)and accepting them with some probability(xt,y)and settingxt+1= y, or rejecting them and settingxt+1= xt. Sincedirectly affects this acceptance probability, it is clear that one should chooseso that the chain does not move too slowly or rejects too many proposed valuesybecause in both these cases convergence to the stationary distribution will be slow. This has been recognised as early as in 22 and has initiated exciting research that has produced optimum average acceptance probabilities for a series of algorithms; see 30,31,32,15,6,8,34,7,35,9. Such optimal average acceptance probabilities provide basic guidelines for adapting single step size parameters to achieve certain average acceptance rates. More sophisticated adaptive MCMC algorithms that can learn a full set of parameters, such as a covariance matrix, borrow information from the history of the chain to optimise some criterion refl ecting the performance of the Markov chain 14,5,33,13,2,1,4. Such methods are typically 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. non gradient-based and the basic strategy they use is to sequentially fi t the proposalq(y|x)to the history of statesxt1,xt,.,by ignoring also the rejected state values. This can result in very slow adaptation because the initial Markov chain simulations are based on poor initialand the generated states, from whichis learnt, are highly correlated and far from the target. The authors in 34 call such adaptive strategies greedy in the sense that they try to adapt too closely to initial information from the output and take considerable time to recover from misleading initial information. In this paper, we develop faster and more robust gradient-based adaptive MCMC algorithms that make use of the gradient of the target,log(x), and they learn from both actual states of the chain and proposed (and possibly rejected) states. The key idea is to defi ne and maximise w.r.t. an entropy regularised objective function that promotes high acceptance rates and high values for the entropy of the proposal distribution. This objective function, referred to as generalised speed measure, is inspired by the speed measure of the infi nite-dimensional limiting diffusion process that captures the notion of speed in which a Markov chain converges to its stationary distribution 32. We maximise this objective function by applying stochastic gradient variational inference techniques such as those based on the reparametrisation trick 19, 29, 40. An advantage of our algorithm compared to traditional adaptive MCMC methods is that the adaptation occurs even when candidate state values are rejected. In fact, the adaptation can be faster when candidate valuesyare rejected since then we make always full use of the gradientlog(y)evaluated at the rejectedy. This allows the adaptation to start in early iterations even if the initial proposal distribution is far from optimum and the chain is not moving. We apply the method for learning multivariate RWM and MALA proposals where we adapt full covariance matrices parametrised effi ciently using Cholesky factors. In the experiments we demonstrate our algorithms to multivariate Gaussian targets and Bayesian logistic regression and empirically show that they outperform several other baselines, including advanced HMC schemes. 2Gradient-based adaptive MCMC Assume a target distribution(x), known up to some unknown normalising constant, wherex Rn is the state vector. To sample from(x)we consider the Metropolis-Hastings (M-H) algorithm that generates new states by sampling from a proposal distributionq(y|x), that depends on parameters, and accepts or rejects each proposed state by using the standard M-H acceptance probability (x,y;) = min ? 1, (y)q(x|y) (x)q(y|x) ? .(1) While the M-H algorithm defi nes a Markov chain that converges to the target distribution, the effi ciency of the algorithm heavily depends on the choice of the proposal distributionq(x|y)and the setting of its parameters . Here, we develop a framework for stochastic gradient-based adaptation or learning ofq(x|y)that maximisesanobjectivefunctioninspiredbytheconceptofspeedmeasurethatunderliesthetheoretical foundations of MCMC optimal tuning 30,31. Given that the chain is at statexwe would like: (i) to propose big jumps in the state space and (ii) accept these jumps with high probability. By assuming for now that the proposal has the standard random walk isotropic form, such thatq(y|x) = N(y|x,2I), the speed measure is defi ned as s(x) = 2 (x;),(2) where2denotes the variance, also called step size, of the proposal distribution, while(x;)is the average acceptance probability when starting atx, i.e.(x;) = R (x,y;)q(y|x)dy. To learn a good value for the step size we could maximise the speed measure in Eq. 2, which intuitively promotes high variance for the proposal distribution together with high acceptance rates. In the theory of optimal MCMC tuning,s(x)is averaged under the stationary distribution(x)to obtain a global speed measure values= R (x)s(x)dx. For simple targets and with increasing dimension this measure is maximised so that2is set to a value that leads to the acceptance probability0.234 30,31. This subsequently leads to the popular heuristic for tuning random walk proposals: tune2 so that on average the proposed states are accepted with probability1/4. Similar heuristics have been obtained for tuning the step sizes of more advanced schemes such as MALA and HMC, where 0.54 is considered optimal for MALA 32 and 0.67 for HMC 24, 9. While the current notion of speed measure from Eq. 2 is intuitive, it is only suitable for tuning proposals having a single step size. Thus, in order to learn arbitrary proposal distributionsq(y|x), 2 where is a vector of parameters, we need to defi ne suitable generalisations of the speed measure. Suppose, for instance, that we wish to tune a Gaussian with a full covariance matrix, i.e. q(y|x) = N(y|x,). To achieve this we need to generalise the objective in Eq. 2 by replacing2with some functionalF()that depends on the full covariance. An obvious choice is to consider the average distance|yx|2given by the tracetr() = P i 2 i. However, this is problematic since it can lead to learning proposals with very poor mixing. To see this note that since the trace is the sum of variances it can obtain high values even when some of the components ofxhave very low variance, e.g. for some xiit holds2 i 0 , which can result in very low sampling effi ciency or even non-ergodicity. In order to defi ne better functionalsF()we wish to exploit the intuition that for MCMC all components of xneed to jointly perform (relative to their scale) big jumps, a requirement that is better captured by the determinant | or more generally by the entropy of the proposal distribution. Therefore, we introduce a generalisation of the speed measure having the form, s(x) = expHq(y|x) (x;) = expHq(y|x) Z (x,y;)q(y|x)dy,(3) whereHq(y|x)= R q(y|x)logq(y|x)dydenotes the entropy of the proposal distribution and 0is an hyperparameter. Note that when the proposal distribution is a full Gaussianq(y|x) = N(y|x,)thenexpHq(y|x) = const | 2which depends on the determinant of.s(x), referred to as generalised speed measure, trades off between high entropy of the proposal distribution and high acceptance probability. The hyperparameterplays the crucial role of balancing the relative strengths of these terms. As discussed in the next section we can effi ciently optimisein order to achieve a desirable average acceptance rate. In the following we make use of the above generalised speed measure to derive a variational learning algorithm for adapting the parameters using stochastic gradient-based optimisation. 2.1Maximising the generalised speed measure using variational inference During MCMC iterations we collect the pairs of vectors(xt,yt)t0wherextis the state of the chain at timetandytthe corresponding proposed next state, which if accepted thenxt+1= yt. When the chain has converged eachxtfollows the stationary distribution(x), otherwise it follows some distribution that progressively converges to(x). In either case we view the sequence of pairs(xt,yt) as non-iid data based on which we wish to perform gradient-based updates of the parameters. In practice such updates can be performed with diminishing learning rates, or more safely completely stop after some number of burn-in iterations to ensure convergence. Specifi cally, given the current state xtwe wish to take a step towards maximising s(xt) in Eq. 3 or equivalently its logarithm, logs(xt) = log Z (x,y;)q(y|xt)dy + Hq(y|xt).(4) The second term is just the entropy of the proposal distribution, which typically will be analytically tractable, while the fi rst term involves an intractable integral. To approximate the fi rst term we work similarly to variational inference 18, 10 and we lower bound it using Jensens inequality, logs(xt) F(xt) = Z q(y|xt)logmin ? 1, (y)q(xt|y) (xt)q(y|xt) ? dy + Hq(y|xt)(5) = Z q(y|xt)min ? 0,log (y) (xt) + log q(xt|y) q(y|xt) ? dy + Hq(y|xt).(6) To take a step towards maximisingFwe can apply standard stochastic variational inference tech- niques such as the score function method or the reparametrisation trick 11,26,28,19,29,40,20. Here, we focus on the caseq(y|xt)is a reparametrisable distribution such thaty = T(xt,?)where Tis a deterministic transformation and ? p(?). F(xt) can be re-written as F(xt) = Z p(?)min ? 0,log (T(xt,?) (xt) + log q(xt|T(xt,?) q(T(xt,?)|xt) ? d? + Hq(y|xt). Since MCMC at thet -th iteration proposes a specifi c stateytconstructed as?t p(?t),yt= T(xt,?t), an unbiased estimate of the exact gradient F(xt) can be obtained by F(xt,?t) = min ? 0,log (T(xt,?t) (xt) + log q(xt|T(xt,?t) q(T(xt,?t)|xt) ? + Hq(y|xt), 3 Algorithm 1 Gradient-based Adaptive MCMC Input:target(x); reparametrisable proposalq(y|x)s.t.y = T(x,?),? p(?); initialx0; desired average acceptance probability . Initialise , = 1. for t = 1,2,3,., do : Propose ?t p(?t), yt= T(xt,?t). : Adapt : + tF(xt,?t). : Accept or reject ytusing the standard M-H ratio to obtain xt+1. : Set t= 1 if ytwas accepted and t= 0 otherwise. : Adapt hyperparameter : 1 + (t ) # default value for = 0.02. end for which is used to make a gradient update for the parameters . Note that the fi rst term in the stochastic gradient is analogous to differentiating through a rectifi ed linear hidden unit (ReLu) in neural networks, i.e. iflog (yt) (xt) + log q(xt|yt) q(yt|xt) 0the gradient is zero (this is the case whenytis accepted with probability one), while otherwise the gradient of the fi rst term reduces to log(T(xt,?t) + log q(xt|T(xt,?t) q(T(xt,?t)|xt). The value of the hyperparametercan allow to trade off between large acceptance probability and large entropy of the proposal distribution. Such hyperparameter cannot be optimised by maximising the variational objectiveF(this typically will setto a very small value so that the acceptance probability becomes close to one but the chain is not moving since the entropy is very low). Thus, needs to be updated in order to control the average acceptance probability of the chain in order to achieve a certain desired value. The value of can be determined based on the specifi c MCMC proposal we are using and by following standard recommendations in the literature, as discussed also in the previous section. For instance, when we use RWMcan be set to1/4(see Section 2.2), while for gradient-based MALA (see Section 2.3) can be set to 0.55. Pseudocode for the general procedure is given by Algorithm 1. We set the learning ratetusing RMSProp 39; at each iterationtwe sett= /(1+Gt), whereis the baseline learning rate, and the updates ofGtdepend on the gradient estimateF(xt,?t)asGt= 0.9Gt+0.1F(xt,?t)2. 2.2Fitting a full covariance Gaussian random walk proposal We now specialise to the case the proposal distribution is a random walk GaussianqL(y|x) = N(y|x,LL)where the parameterL is a positive defi nite lower triangular matrix, i.e. a Cholesky factor. This distribution is reparametrisable sincey TL(x,?) = x + L?,? N(?|0,I). At thet-th iteration when the state is xtthe lower bound becomes FL(xt) = Z N(?|0,I)min0,log(xt+ L?) log(xt)d? + n X i=1 logLii+ const.(7) Here, the proposal distribution has cancelled out from the M-H ratio, since it is symmetric, while Hq(y|xt)= log|L| + constandlog|L| = Pn i=1logLii. By making use of the MCMC proposed state yt= xt+ L?twe can obtain an unbiased estimate of the exact gradient LFL(xt), LFL(xt,?t) = (? ytlog(yt) ? t ? lower + diag( 1 L11,., 1 Lnn), if log(yt) xlog(x),LL)where the covariance matrix is parametrised by the Cholesky factorL. Again this distribution is reparametrisable according toy TL(x,?) = x + (1/2)LLlog(x) + L?,? N(?|0,I). At thet-th iteration when the state isxtthe reparametrised lower bound simplifi es signifi cantly and reduces to, FL(xt) = Z N(?|0,I)min n 0,log ? xt+ (1/2)LLlog(xt) + L? ? log(xt) 1 2 ? |(1/2)Llog(xt) + log(y) + ?|2 |?|2 ?o d? + n X i=1 logLii+ const, where|denotes Euclidean norm and in the termlog(y),y = xt+(1/2)LLlog(xt)+L?. Then, based on the proposed stateyt= TL(xt,?t)we can obtain the unbiased gradient estimate FL(xt,?t)similarly to the previous section. In general, such an estimate can be very expensive because the existence ofLinsidelog(yt)means that we need to compute the matrix of second derivatives or Hessianlog(yt). We have found that an alternative procedure which stops the gradient insidelog(yt)(i.e. it viewslog(yt)as a constant w.r.t.L) has small bias and works well in practice. In fact, as we will show in the experiments this approximation not only is computationally much faster but remarkably also it leads to better adaptation compared to the exact Hessian procedure, presumably because by not accounting for the gradient insidelog(yt)reduces the variance. Furthermore, the expression of the gradient w.r.t.Lused by this fast approximation can be computed very effi ciently with a singleO(n2)operation (an outer vector product; see Supplement), while each iteration of the algorithm requires overall at most fourO(n2)oper

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論