




已閱讀5頁,還剩3頁未讀, 繼續(xù)免費閱讀
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
EnsembleDAgger A Bayesian Approach to Safe Imitation Learning Kunal Menda 1Katherine Driggs Campbell 2and Mykel J Kochenderfer1 Abstract Although imitation learning is often used in robotics the approach frequently suffers from data mismatch and compounding errors DAgger is an iterative algorithm that addresses these issues by aggregating training data from both the expert and novice policies but does not consider the impact of safety We present a probabilistic extension to DAgger which attempts to quantify the confi dence of the novice policy as a proxy for safety Our method EnsembleDAgger approximates a Gaussian Process using an ensemble of neural networks Using the variance as a measure of confi dence we compute a decision rule that captures how much we doubt the novice thus determining when it is safe to allow the novice to act With this approach we aim to maximize the novice s share of actions while constraining the probability of failure We demonstrate improved safety and learning performance compared to other DAgger variants and classic imitation learning on an inverted pendulum and in the MuJoCo HalfCheetah environment I INTRODUCTION To be truly intelligent robotic systems must have the ability to learn by exploring their environment and state space in a safe way 1 One method to guide exploration is to learn from expert demonstrations 2 3 4 In contrast to reinforcement learning where an explicit reward func tion must be defi ned imitation learning guides exploration through expert supervision allowing a robot to effectively learn from direct experience 5 However such supervised approaches are often suboptimal or fail when the policy that is being trained referred to as the novice policy encounters situations that are not adequately represented in the dataset provided by the expert 6 7 While failures may be insignifi cant in simulation safe learning is important when acting in the real world 1 There are several methods for guided policy search in imitation learning settings 8 One example is DAgger which improves the training dataset by aggregating new data from both the expert and novice policies 7 DAgger has many desirable properties including online functionality and theoretical guarantees This approach however does not guarantee safety Recent work extended DAgger to address some inherent drawbacks 9 10 In particular SafeDAgger augments DAgger with a decision rule policy to provide safe exploration while minimizing queries to the expert 11 The shared goal of these methods is to effi ciently train the novice to control the system while minimizing expert intervention These algorithms assume that by allowing the novice to act the system will likely deviate from the expert trajectory set and sample a new state There is a chance 1Kunal Menda and Mykel J Kochenderfer are at Stanford University Stanford CA 94305 USA kmenda mykel stanford edu 2Katherine Driggs Campbell is at the University of Illinois at Urbana Champaign IL 61820 USAkrdc illinois edu however that the state visited is unsafe or is a failure state If the expert acts instead we assume that the system will move along a safe trajectory which is likely through states similar to those previously observed The goal of this paper is to present an algorithm that maximizes the novice s share of actions while constraining the probability of failure Ideally the proximity to a failure state measured as an l2 distance or likelihood of encountering the state under some operating condition is known and a safety envelope can be computed to guarantee safety 12 In the case of model free learning such guarantees are much more diffi cult to make If we consider the novice action to be a perturbed form of the expert action then we hypothesize that for many systems the magnitude of permissible perturbation to expert actions is related to the distance from unsafe regions Further in a model free case where expert demonstrations are available we hypothesize that there is an inverse relationship between a state s similarity to those in expert trajectories and allowed perturbations We visualize this intuition in Figure 1 In the left panel we see that the maximum permissible deviation from an expert action should be low as the system approaches a wall which is considered a dangerous state In such settings experts will likely prefer trajectories that maintain some margin of distance from unsafe states Assuming this to be the case it follows that in unfamiliar states the system is likely at higher risk of entering failure states and thus it is safer to allow the expert to act While in familiar regions it is permissible for the novice to act with large deviation from expert action This paper extends DAgger to a probabilistic domain and aims to minimize expert intevention while constraining the likelihood of failure While SafeDAgger uses the discrepancy between the expert and the novice to determine safety we measure doubt by quantifying the uncertainty or confi dence of the novice policy To quantify doubt we use an ensemble of neural networks to estimate the variance of the novice action in a particular state which we show can effectively approximate Gaussian Processes GPs even in complex high dimensional spaces 13 We demonstrate how our method out performs existing DAgger variants in an imitation learning setting This paper makes two key contributions 1 we present EnsembleDAg ger a Bayesian extension to DAgger which introduces a probabilistic notion of safety to minimize expert interven tion while constraining the probability of failure and 2 we demonstrate the utility of this approach with improved performance and safety in an imitation learning case study on an inverted pendulum and demonstrate the scalability of the approach on the MuJoCo HalfCheetah domain 2019 IEEE RSJ International Conference on Intelligent Robots and Systems IROS Macau China November 4 8 2019 978 1 7281 4003 2 19 31 00 2019 IEEE5041 Max Allowed Perturbation Similarity to DemonstrationDistance to a Failure State Small acceptable perturbationLarge acceptable perturbation Fig 1 Visualization of the tradeoffs between familiarity and risk left Example scenarios of where perturbations are not permissible due to low high risk Red trajectories illustrate expert corrections and the blue trajectory illustrate novice actions right Plots visualizing the ideal tradeoff between distance to failure state and allowed deviations and the approximate of this tradeoff using similarity to expert demonstrations and deviations II BACKGROUND This section presents a brief technical overview of DAgger SafeDAgger and different methods for approximating GPs using neural networks A DAgger and SafeDAgger The DAgger framework extends traditional supervised learning approaches by simultaneously running both an ex pert policy that we wish to clone and a novice policy we wish to train 14 By aggregating new data from the expert the underlying model and reward structure are uncovered Using supervised learning we train an initial novice policy nov 0on some initial training set D0generated by the expert policy exp With this initialization DAgger itera tively collects additional training examples from a mixture of the expert and novice policy During a given episode the combined expert and novice system interacts with the environment under the supervision of a decision rule The decision rule referred to as DR in Algorithm 1 decides at every time step t whether to use the action from the novice or expert to interact with the environment Figure 2 The observations otreceived during each epoch and the expert s choice of corresponding actions make up a new dataset called Di The new dataset of training examples is combined with the previous sets D D Di and the novice policy is then re trained on D as presented in Algorithm 1 By allowing the novice to act the combined system explores parts of the state space further from the nominal trajectories of the expert In querying the expert in these parts of the state space the novice is able to learn a more robust policy However allowing the novice to always act risks the possibility of encountering an unsafe state which can be costly in real world experiments The VanillaDAgger algorithm and SafeDAgger balance this trade off by their choice of decision rules Under VanillaDAgger Algorithm 2 the expert s action is chosen with probability i 0 1 where i denotes the DAgger epoch If i i 1for some 0 1 then the novice takes increasingly more actions each epoch As the novice is given more training labels from previous epochs it is allowed greater autonomy in exploring the state space The VanillaDAgger decision rule does not consider any similarity measure between the novice and expert actions Hence even exp ot nov ot Decision Rule Environment aexp t anov t at ot Fig 2 Flowchart for DAgger variants where the decision rule differs between approaches Algorithm 1 DAGGER 1 procedure DAGGER DR 2 Initialize D 3 Initialize nov i 4 for epoch i 1 K 5 Sample T step trajectories using DR 6 Get Di ot exp ot t 1 T 7 Aggregate datasets D D Di 8 Train nov i 1on D if the novice suggests a highly unsafe action VanillaDAgger allows the novice to act with probability 1 i The optimal decision rule approximated by SafeDAgger presented in Algorithm 3 and referred to as SafeDAgger computes the discrepancy between the expert and novice actions and allows the novice to act if the distance between the actions is less than some chosen threshold 11 1 Though this decision rule is claimed to be optimal we argue that it has a shortcoming An ideal decision rule would allow the novice to act if there is a suffi ciently low probability that the system can transition to an unsafe state If the combined system is currently near an unsafe state the tolerable perturbation from the expert s choice of action is smaller than when the system is far from unsafe states Hence in practice the 1To reduce the number of expert queries SafeDAgger approximates the SafeDAgger decision rule using a deep policy that determines whether or not the novice policy is likely to deviate from the reference policy Unlike SafeDAgger we are not concerned with minimizing expert queries during a given episode Hence we compare to the SafeDAgger decision rule directly as opposed to the approximation 5042 Algorithm 2 VANILLADAGGERDecision Rule 1 procedure DR ot i 0 2 anov t nov ot 3 aexp t exp ot 4 i i 0 5 z Uniform 0 1 6 if z i 7 return aexp t 8 else 9 return anov t Algorithm 3 SAFEDAGGER Decision Rule 1 procedure DR ot 2 anov t nov ot 3 aexp t exp ot 4 if kanov t aexp tk2 5 return anov t 6 else 7 return aexp t single threshold employed in SafeDAgger is either too conservative when the system is far from unsafe states or too relaxed when near them To approximate the ideal decision rule in a model free manner we propose not just considering the distance between the novice s and expert s actions but also the uncertainty in the novice policy at a given state To estimate the uncertainty of the novice policy we use Bayesian deep learning There are two works which build on the algorithm pre sented in this work Kelly et al 15 perform experiments on an autonomous vehicle and fi nd a safe method to query humans for demonstrations and calibrating the threshold parameters of the algorithm presented in our work Cronrath et al 16 propose an extension of our ideas that attempt to combine the improved safety of a Bayesian extension to DAgger with the query effi ciency of SafeDAgger B Bayesian Approximation Methods Recent research has focused on approximating GPs with neural networks 17 While GPs alone have shown great success in modeling uncertainty and approximating safety 18 traditional GP approaches are computationally expen sive for high dimensional feature spaces and large datasets 13 Advances in deep learning have shown great success in handling these complexities Two methods for approximating GPs with deep neural networks are ensemble methods 19 and Monte Carlo dropout 20 Refer to Appendix A of 21 for a summary of advantages and disadvantages of these approaches and an empirical evaluation of these methods In this work we chose to use the ensemble method which is a technique for training a collection of neural networks to execute the same task and then combining the output into a single prediction This approach has shown to signifi cantly improve performance in practice 22 There is a work that employed an ensemble of neural networks to approximate GPs and demonstrated that this is a more straightforward approach to estimate predictive uncertainty 19 Typically neural networks predict point estimates of the output that are optimized to minimize the mean squared error on the training set The authors claim that this approach does not capture irreducible or aleatoric uncertainty but only epistemic uncertainty They propose using a proper scoring rule like negative log likelihood as a loss function to train an ensemble in which each network predicts a mean and a variance of a Gaussian distribution over the output They postulate that such loss functions provide a better measure of the quality of predictive uncertainty and thus reward better calibrated predictions Network predictions are then combined as a mixture of Gaussians III ENSEMBLEDAGGER We present the EnsembleDAgger decision rule in which the discrepancy between the expert s and the novice s mean action as well as the novice s doubt which is variance of the novice s action are used to decide whether to choose the novice action According to the EnsembleDAgger decision rule the novice must satisfy two conditions in order to act The fi rst is that the discrepancy between the novice and expert s action i e k anov t aexp tk2 must be less than some threshold This is the SafeDAgger decision rule but will henceforth be referred to as the discrepancy rule Assuming the novice policy outputs a variance on its predicted action 2 anov t as an ensemble of neural networks would then the second condition is that 2 anov t is less than some threshold We refer this condition as the doubt rule As shown in Figure 3 in order for the novice to act according to the EnsembleDAgger decision rule it must satisfy both the discrepncy rule and the double rule The algorithm described in Algorithm 4 is parameterized by the values and We restate the assumptions made to explain why the this decision rule is able to better guarantee the system s safety 1 The expert prefers trajectories that avoid failure states and rarely visits near failure states implying that states dissimilar to those in expert trajectories or states unfamiliar to the novice are likely to be in closer proximity to failure states 2 Following from 1 and by capturing epistemic uncer tainty or lack of familiarity with states in the training dataset the novice s doubt provides a model free proxy for proximity to failure states 3 In order to constrain the probability of encountering a failure state the discrepancy between the action taken and the expert s action is less than some bound 4 The ideal bounds should be state dependent such that the bound is tighter in close proximity to failure states 5 Following from 2 4 the bound on discrepancy should decrease as the novice s doubt increases Also it is assumed that the expert policy is primarily uni modal as is commonly assumed in most imitation learning settings Further using a neural network based dissimilarity measure is useful for imitation learning as neural networks 5043 Novice Doubt Novice Discrepancy Novice Acts Fig 3 The EnsembleDAgger decision rule is parametrized by doubt and discrepancy bounds and is a low order model free approximation to the ideal decision rule shown in green Algorithm 4 EnsembleDAgger Decision Rule 1 procedure DR ot 2 anov t 2 anov t nov ot 3 aexp t exp ot 4 k anov t aexp tk2 5 2 anov t 6 if and 7 return anov t 8 else 9 return aexp t scale more gracefully to high dimensional input spaces and large datasets than most non parametric measures Given that we have a measure of doubt via the variance on novice actions we ideally would like to specify the bound on discrepancy as a monotonically increasing function of doubt To meet this end we have experimented with the idea of making the discrepancy bound proportional to the inverse of doubt However the parameters specifying an arbitrary function mapping doubt to a discrepancy bound must be considered hyperparameters to the algorithm and tuned by the practitioner We opt for the low order approximation to the ideal functional mapping shown in Figure 3 because the two hyperparameters and are easy to interpret By appropriately choosing the hyper parameters and we satisfy the dual objectives of allowing the novice to act only if it is suffi ciently confi dent in its action and close to the expert As the decision rule converges to that of SafeDAgger As the decision rule ignores discrepancy and allows the novice to act if it is confi dent without comparison to what the expert action is However since the novice is only confi dent in states similar to those in D it is likely that the novice having low doubt causes its action to also have low discrepancy implying that the algorithm is less sensitive to an arbitrary increase in than to an arbitrary increase in This statement is qualifi ed in the next section by showing that using the doubt rule alone by setting leads to better performance than using the discrepancy rule alone by setting Though not u Fig 4 The inverted pendulum environment has a state space of and an action space of the torque u 3 2 10123 4 2 0 2 4 rad rad s Expert converages Initial conditions Fig 5 This fi gure shows states in the expert s basin of attraction i e states from which the expert converges to the origin The fi gure also shows the set from which initial conditions of DAgger epochs are uniformly drawn in this experiment the focus of this work it is also worth noting that the expert only needs to be queried if the do
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年社會人文科學(xué)研究服務(wù)項目立項申請報告模范
- 2025年地震勘探數(shù)據(jù)處理系統(tǒng)項目立項申請報告模板
- 2025年光功率計項目申請報告
- 故障診斷及解決方案協(xié)議
- 管理權(quán)約定協(xié)議
- 2025年鄉(xiāng)村文化旅游發(fā)展報告:文旅融合視角下的鄉(xiāng)村旅游與鄉(xiāng)村旅游者體驗研究
- 2025年智能交通信號優(yōu)化系統(tǒng)在城市交通信號燈控制中的節(jié)能效果報告
- 2025年智慧物流園區(qū)資金申請與風(fēng)險管理報告
- 2025年交通運輸行業(yè)交通規(guī)劃與設(shè)計人才需求與培養(yǎng)模式研究報告
- 2025年休閑農(nóng)業(yè)與鄉(xiāng)村旅游鄉(xiāng)村旅游與旅游直播電商融合發(fā)展報告
- 健身器材采購項目投標(biāo)方案
- Linux操作系統(tǒng)期末復(fù)習(xí)題(含答案)
- 高考化學(xué)一輪復(fù)習(xí)知識清單:鈉及其重要化合物
- 醫(yī)院行風(fēng)建設(shè)教育
- 為家庭開銷做預(yù)算(課件)四年級下冊綜合實踐活動長春版
- 2024年河北省中考數(shù)學(xué)試題(含答案解析)
- DL∕T 1919-2018 發(fā)電企業(yè)應(yīng)急能力建設(shè)評估規(guī)范
- 貴州2024年貴州醫(yī)科大學(xué)招聘專職輔導(dǎo)員筆試歷年典型考題及考點附答案解析
- 遼寧省沈陽市鐵西區(qū)2023-2024學(xué)年七年級下學(xué)期期末數(shù)學(xué)試題
- 2024年紫金山科技產(chǎn)業(yè)發(fā)展集團(tuán)招聘筆試沖刺題(帶答案解析)
- 2022版科學(xué)課程標(biāo)準(zhǔn)解讀-面向核心素養(yǎng)的科學(xué)教育(課件)
評論
0/150
提交評論