




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
arXivv[cs.LG]19Jan2023AarXivv[cs.LG]19Jan2023JacobBeck*jacob.beck@cs.ox.ac.ukUniversityofOxfordRistoVuorio*risto.vuorio@cs.ox.ac.ukUniversityofOxfordEvanZheranLiuZhengXiongLuisaZintgraf十evanliu@zheng.xiong@cs.ox.ac.ukzintgraf@StanfordUniversityUniversityofOxfordUniversityofOxfordChelseaFinncbfinnChelseaFinncbfinn@shimon.whiteson@cs.ox.ac.ukUniversityofOxfordAbstractWhiledeepreinforcementlearning(RL)hasfueledmultiplehigh-pro?lesuc-dbackfrommorewidespreadadoptionbynywe1IntroductioninforcementlearningmetaRLisafamilyofmachinelearningMLmethodsthatlearntoreinforcementlearn.Thatis,meta-RLusessample-inef?cientMLtolearnsample-ef?cientRLtedasamachinelearningproblemforasignicantperiodoftimeIntriguingly,32Background2.1Reinforcementlearningtoastheagent’senvironment.AnMDPisde?nedbyatupleM=〈s,A,P,P0,R,γ,T),wheresisthesetofstates,Athesetofactions,P(st+11st,at):sxAxs→R+theprobabilityoftransitioningfromstatesttostatest+1aftertakingactionat,P0(s0):s→R+isadistributionApolicyisafunctionπ(a1s):sxA→R+thatmapsstatestoactionprobabilities.Thisway,TPPsat1st)P(st+11st,at).t=0J(π)=Eτ~P(τ)┌t0γtrt┐,ltipleepisodesaregathered.IfHepisodeshavebeengathered,thenD={τh}=0isallofthedatadeneanRLalgorithmasthefunctionf(D):((sxAxR)T)H→Φ.Inpractice,thedatamayincludeath2.2Meta-RLde?nitionisinsteadtolearn(partsof)analgorithmfusingmachinelearning.WhereRLlearnsapolicy,fthehumanfromdirectlydesigningandimplementingtheRLalgorithmsmtomaximizeameta-RLobjective.Hence,fθoutputstheparametersofπφdirectly:φ=fθ(D).Werefertothepolicyπφasthebasepolicywithbaseparametersφ.Here,Disameta-trajectoryylerdinglywemaycalltheouterloopparametersandortheesupportedbyanysetoftasksHoweverorsandAtobesharedbetweenallofthetasksandthetaskstoonly41s-s-1s-d方p1d方1s-s-1s-d方p1d方p1s-s-1sssssdd方p-s-s-11ittedinsweetheτeDK∶H7(θ)=EMi~p(M)┌ED┌G(τ)f│θ,Mi┐┐,τeDK∶HwhereG(τ)isthediscountedreturnintheMDPMiandHisthelengthofthetrial,orthetask-erloopfθ(D).2.3ExamplealgorithmsMetaLearningMAMLwhichusesmetagradientsandFastRLviaSlowRL(RL2),whichusesrecurrentneuralnetworks[46,239].Manymeta-RLalgorithmsbuildonsimilartothoseusedinMAMLandRLwhichmakesthemexcellentMAMLManydesignsoftheinner-loopalgorithmfθbuildonexistingRLalgorithmsandusemeta-learningtoimprovethem.MAML[55]isanin?uentialdesignfollowingthispattern.ItsrsandithgradientdescenttobeagoodstartingpointforlearningontasksfromthetaskdistributionWhenadaptingtoanewtask,MAMLcollectsdatastepforataskMi~p(M):φ=f(D,φ0)=φ0+α5φОJ?(D,πφО),5ptuallyrightwhereJ?(D,πφО)isanestimateofthereturnsofthepolicyπφОforthetaskMiandαistheφ=φ0+β5φОJD1),Mi~p(M)whereπ1isthepolicyfortaskiupdatedoncebytheinner-loop,βisalearningrate,andientDepolicyforvariancereductionhighervaluesofKinitmeralwithKuptodifferencesinthediscountingTooptimizemtheRNNHoweverMAMLcannottrivially6Multi-taskMulti-task-RL2[46,239],MAML[55]-LPGLPGMetaGenRL9]2.4ProblemCategoriesWhilethegivenproblemsettingappliestoallofmetaRLdistinctclustersintheliteraturehaveultitasksettingInthissettinganagentmustquicklyrringtrainingMethodsforthismanyshotsingletasksettingtendto7Meta-LearningFew-ShotMeta-RLMeta-LearningAdaptationGoal MDP1 MDP2 MDP3Rl2,L2RL,VariBADMeta-Learning MDP1 MDP2 MDP3 MAML,DREAMZero-ShotPerformwellfromstartMethods:Few-ShotFreeexplorationphaseMethods:Learnnewtaskswithinafewsteps/episodesOverMultiple(similar)tasks......Meta-LearningMany-ShotMetaMeta-LearningGoalLearnnewtasksbetterthanstandardRLalgorithmsLPG,MetaGenRLMeta-LearningOverMultiple(diverse)tasks MDP1 MDP2 MDP3SolutionsMethods:AdaptationGoalAcceleratestandardRLalgorithmsMeta-LearningOverwindowsinasingletask.(Noreset)SolutionsMethods:STACX,FRODOMeta-LearningAdaptationAdaptationAdaptationMDPMDP18ParameterizedpolicygradientsMAML-likeFinnetal.[55],Lietal.[124],Sungetal.[219],Vuorioetal.[235],ZintgrafMAML-likeDistributionalMAMLndMeta-gradientestimationFoersteretal.[60],Al-Shedivatetal.[207],Stadieetal.[216],Liuetal.[133],Maoetal.[139],Fallahetal.[52],Tang[222],andVuorioetal.[234]BlackboxnerloopHeessetal.[88],Duanetal.[46],Wangetal.[239],Humpliketal.[95],Fakooretal.[51],Yanetal.[256],Zintgrafetal.[281],Liuetal.[130],andZintgrafetal282]AttentionMishraetal.[150],Fortunatoetal.[62],Emukpereetal.[49],Ritteretal.[190],Wangetal.[240],andMelo[141]HypernetworksXianetal.[250]andBecketal.[17]TaskInferenceMulti-taskpre-trainingHumpliketal.[95],Kamiennyetal.[104],Raileanuetal.[182],Liuetal.[130],andPengetal.[174]LatentforZhouetal.[278],Raileanuetal.[182],Zintgrafetal.[281],Zhangetal.[268],Zintgrafetal.[282],Becketal.[17],Heetal.[86],andImagawaetal.97]ConstrastivelearningFuetal.[64]rnerwouldacgthis3Few-ShotMeta-RLkinhomekitchensTraininganewereitntocookinitHowevertrainingsuchanagentwithmetaRLinvolvesuniquefew-shotsetting.Recallthatmeta-RLitselflearnsalearningalgorithmfθ.Thisplacesunique?Parameterizedpolicygradientmethodsbuildthestructureofexistingpolicygradiente9PPGMethodBlackBoxMethodGeneralizationeralizationalizationAMLalizationAMLLrereInductivebiasinstructureInductiveInductivebiasinstructureInductivebiasfromdatachallengesOnesuchrningsnsupervision.Inthestandardmeta-RLproblemsetting,rewardsareavailableduringbothmeta-ample,itmaybedif?culttomanuallydesignaninformativetaskdistributionformeta-training,metanges3.1ParameterizedPolicyGradientMethodsMeta-RLlearnsalearningalgorithmfθ,theinner-loop.WecalltheparameterizationoffθthesectionwediscussonewayofparameterizingtheinnerloopthatbuildsinthestructureofexistingstandardRLalgorithms.Parameterizedpolicygradients(PPG)φj+1=fθ(Dj,φj)=φj+αθ5φjJ?θ(Dj,πφj),teverφj+1=φj+αθMθ5φjJ?θ(Dj,πφj)[255,170,58].Whileavaluebased-methodcouldbeusedcanbeupdatedwithback-propagationinaPPGmethodorbyaneuralnetworkinablackboxodslearnafulldistributionoverinitialpolicyparameters,p(φ0)[82,260,242,285,73].Thisterstion?tviavariationalinference[82,73].Moreover,thedistributionitselfcanbeupdatedintheyweightsandbiasesofthelastlayerofthepolicy[181],whileleavingtherestoftheparametersectorditionedInthiscasetheinputtothepolicyitselfparameterizesaMeta-gradientestimationinouter-loopoptimizationEstimatinggradientsfortheouter-loopisnnerloopThereforeoptimizingtheouter-looprequirestakingthegradientofagradient,orameta-gradient,whichinvolvesofdatausedbyinnerlooponpriortedbydataintheouterloopStillthesepriorpoliciesdoaffectthedistributionofdatasampledinD,usedlaterbytheinner-looplearningalgorithm.Thusignoringthegradienttermsinthepolicyentpwithanmethodmayusearstorderapproximation63],orusegradient-freeoptimizationtoopti-Outer-loopalgorithmsWhilemostPPGmethodsuseapolicy-gradientalgorithmintheouter-saDAdditionally,onecantraintask-speci?cexpertsandthenusetheseforimitationlearninginthetorybehaviorbyoptimizingEquationtheycaneoverPPG3.2BlackBoxMethodsauniversalfunctionapproximator.ThisplacesfewerconstraintsonthefunctionfθthanwithaedbystructureByconditioningapolicyonacontextvector,alloftheweightsandbiasesofTmustgeneralizebetweenalltasksHoweverwhensignicantlydistinctpoliciesarerequiredfordifferenttasks,cydirectlyTheinnerloopmayproducealloftheparametersofafeedInner-looprepresentationWhilemanyblackboxmethodsuserecurrentneuralnetworks,[88,opionmechaexOuter-loopalgorithmsWhilemanyblackboxmethodsuseon-policyalgorithmsintheouter-loop[46,239,281],itisstraightforwardtouseoff-policyalgorithms[185,51,130],whichbringBlackboxtrade-offsOnekeybene?tofblackboxmethodsisthattheycanrapidlyaltertheirnoftenstruggletogeneralizeoutsideofpM,252].Considertherobotchef:whileitkboxingafullyblack-boxmethod,thepolicyorinner-loopcanbe?ne-tunedwithpolicygradientsat3.3TaskInferenceMethodsritrainingforeachtask,withnoplanningrequired.Infact,trainingapolicyoveradistributionoftasks,withaccesstothetruetask,canbetakenasthede?nitionofmulti-taskRL[263].InthedsmapthetaskdirectlytoweightspolicyheasTaskinferencewithprivilegedinformationAstraightforwardmethodforinferringthetaskistokcMionoftionwnTaskinferencewithmulti-tasktrainingSomeresearchusesthemulti-tasksettingtoimproventedtourinonthatencodesthetaskreprensitcontainsonlythisinformation[95,130].Afterthis,gθ(cM)canbeinferredinmeta-learningtaskRLmaybeisneededforthemeta-RLpolicytoidentifythetask.InthiscaseinsteadofonlyinferringthefcientlymanyexploratorythetasksharingpoliciesbecomeslessfeasibleOftenintrinsicrewardsareTaskinferencewithoutprivilegedinformationOthertaskinferencemethodsdonotrelyonForinstanceataskcanbeonortransitionfunction[278,281,268,280,86];andtaskinferencecanusecontrastivelearningHepisodesxAxAxAxAxAxAxAxAxxAKepisodestrationoffreeexplorationinrstKepisodesyellowfollowedbynotfreeexploedbyexploitationwhitedistributionusingavariationalinformationbottleneckesldtoreheotherhandtrainingthehattion3.4ExplorationandMeta-ExplorationshouldworkforanyMDPandmayconsistofrandomon-policyexploration,epsilon-greedyex-istypeofexplorationstilloccursintheadditionallyexistsexplorationintheZhouetal.[278],Gurumurthyetal.[83],Fuetal.[64],Liuetal.[130],andZhangetal.[268]pMToenablesampleefcientadaptationduseddistribution.Recallthatinthefew-shotadaptationsetting,oneachtrial,theagentisplacedintoanewtaskonsolvingthetaskinthenextfewepisodes(i.e.,overtheH_KepisodesinEquation3).Anduringtentiallyevenbeyondtheinitialfewshotswithexploitingwhatitalreadyknowstoachievehighrewards.Itisalwaysoptimaltoexploreinthe?rstKepisodes,sincenoicingshorttermrewardstolearnabetterpolicyforhigherlaterreturnspaysdividends,whilewhenH_Kissmall,theagentmustexploitmoretoobtainanyrewarditcan,optimallyEnd-to-endoptimization.Perhapsthesimplestapproachistolearntoexploreandexploitend-to-endbydirectlymaximizingthemeta-RLobjective(Equation3)asdonebyblackboxmeta-RLapproaches[46,239,150,216,26].Approachesinthiscategoryimplicitlylearntoexplore,astheydirectlyoptimizethemeta-RLobjectivewhosemaximizationrequiresexploration.Morespeci?cally,thereturnsinthelaterK_HepisodesτeDG(τ)canonlybemaximizedifthepolicyappropriatelyexploresinthe?rstKepisodes,somaximizingthemeta-RLobjectivecanyieldoptimalexplorationinprinciple.Thisapproachworkswellwhencomplicatedexplorationstrategiesarenotneeded.Forexample,ifattemptingseveraltasksinthedistributionoftasksisareasonableformofexplorationforaparticulartaskdistribution,thenend-to-endoptimizationmayworkwell.ingredients(i.e.,explore)ifdoingsoresultsinacookedmeal.Hence,itischallengingtolearnLriorsamplingTocircumventthechallengeofimplicitlylearningtoexploreRakellyetalwhattheidentityofthetaskis,andthentoiterativelyre?nethisdistributionbyinteractingwithviahatsalongtsinitialpositionrenttaddningthedynamicsandrewardfunctioninformationgainoverthetaskdistribution[64,130],orareductioninuncertaintyoftheposte-?rstKepisodes,andthentheexploitationpolicyexploitsfortheremainingH_Kexploitationinformationaboutthetaskdynamics,butareirrelevantforarobotcheftryingtocookameal.cyusedhespaceofxxOptimal0-shotPosteriorSamplingIrrelevantExplorationAxxAxx AEpisode1xxAxxAxxAEpisode2xxAxxAxxAEpisode30imalexplorationandposteriorsamplingThethirdrowconsideringthatthisintrinsicrewardcanbeusedtotrainapolicyexclusivelyforoff-policydatasnotForexample,usingrandomnetworkdistillation[29],arewardmayaddanincentivefornovelty[282],oraddanincentiveforgettingdatawhereTD-errorishigh[77].Manyoftheserewards3.5Bayes-AdaptiveOptimalityrtaintyInsteadoptimalexplorationonlyreducesuncertaintygsmeforexplorationislimitedThereforesiscussximateBayesoptimalpoliciesandanalyzethebehaviorofBayes-adaptiveMarkovdecisionprocesses.Todeterminetheoptimalexplorationstrategy,weicsandrewardfunctionFromahighleveltheBayesadaptiveMarkovdeciBAMDPmaximizesreturnswhenplacedintoanunknownMDP.Crucially,thedynamicsofthearacterizesthecurrentuncertaintyasadistributionoverpotentialtionstsarst)sofar,andtheinitialbeliefb0isapriorp(r,p).Then,thestatesofntySpecicallytheBAMDPrewardR+(st,bt,at)=ER~bt[R(st,at)].(4)heBAMDP:P+(st+1,bt+11st,bt,at)=ER,P~bt[P(st+11st,at)δ(bt+1=p(R,P1τ:t+1)].(5)hecurrentbeliefr=R+(st,bt,at)=ER~bt[R(st,at)].EbRstbtatLearninganapproximateBayes-optimalpolicyDirectlycomputingBayes-optimalpoliciesre-onandthelatentvariablesmcanbelearnedbyrollingoutthepolicytoobtainonarnBayesadaptiveoptimalpoliciestheframeworkofBAMDPscanstillofferahelpfulFirst,blackboxmeta-RLalgorithmssuchasRL2learnarecurrentpolicythatnotonlycondi-tionsonthecurrentstatest,butonthehistoryofobservedstates,actions,andrewardsτ:t=memputingthebeliefstateetaRLalgorithmscaninprinciplelearnBayesadaptiveatetosmetaRLalgorithmsstruggletolearnischallengingLiuetalhighlightonesuchoptimizationchallengeforblackboxmeta-RLwheretheagentisgivenafew“free”episodestoexplore,andtheobjectiveistomaximizethernsbeginningfromthersttimestepThesetheresultinusinglesssuitableutensilsoringredients,though,especiallywhenoptimizedatlowerlveinterhecurrenttaskwhichisequivalenttothebeliefstateThenexplocycanbesuf?cientforoptimallysolvingthemeta-RLproblem,evenifitdoesnotmakeuseofallthisstate3.6SupervisionInthissection,wediscussmostofthedifferenttypesofsupervisionconsideredinmeta-RL.Inexperttrajectoriesorotherprivilegedinformationduringmeta-trainingand/ortesting).EachofMeta-RLMeta-RLwithMeta-RLviaImitationHYPER
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 高效會議組織方案
- 貴州企業(yè)招聘2024貴陽市黔爽城市公共交通有限公司招聘40人筆試參考題庫附帶答案詳解
- 關(guān)于組織團隊建設(shè)活動的方案
- 貴州2025年貴州省文化和旅游廳直屬事業(yè)單位招聘12人筆試歷年參考題庫附帶答案詳解
- 默示合同范本(2篇)
- 高端培訓(xùn)服務(wù)協(xié)議書(2篇)
- 養(yǎng)雞場蛋雞管理安全培訓(xùn)
- 物業(yè)上門服務(wù)流程培訓(xùn)
- 快遞站點操作流程
- 網(wǎng)絡(luò)小說的“鉤子”
- 內(nèi)蒙古機電職業(yè)技術(shù)學(xué)院單獨招生(機電類)考試題(附答案)
- 人教版(2024)七下 第二單元第1課《精彩瞬間》課件-七年級美術(shù)下冊(人教版)
- 六分鐘步行試驗記錄表
- 2021年新湘教版九年級數(shù)學(xué)中考總復(fù)習(xí)教案
- 北師大版 三年級下冊數(shù)學(xué)教案-整理與復(fù)習(xí)
- 煤礦竣工驗收竣工報告
- 北京華恒智信人力資源顧問有限公司ppt課件
- 1聚焦義務(wù)教育語文第三學(xué)段課標、教材與教學(xué)
- DLT_5210.1-2012_電力建設(shè)施工質(zhì)量驗收及評價規(guī)程_第1部分土建工程__配套表格
- 《創(chuàng)新創(chuàng)業(yè)教育基礎(chǔ)》課程教學(xué)大綱3篇
- 紫微斗數(shù)筆記Word版
評論
0/150
提交評論