版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
arXivv[cs.LG]19Jan2023AarXivv[cs.LG]19Jan2023JacobBeck*jacob.beck@cs.ox.ac.ukUniversityofOxfordRistoVuorio*risto.vuorio@cs.ox.ac.ukUniversityofOxfordEvanZheranLiuZhengXiongLuisaZintgraf十evanliu@zheng.xiong@cs.ox.ac.ukzintgraf@StanfordUniversityUniversityofOxfordUniversityofOxfordChelseaFinncbfinnChelseaFinncbfinn@shimon.whiteson@cs.ox.ac.ukUniversityofOxfordAbstractWhiledeepreinforcementlearning(RL)hasfueledmultiplehigh-pro?lesuc-dbackfrommorewidespreadadoptionbynywe1IntroductioninforcementlearningmetaRLisafamilyofmachinelearningMLmethodsthatlearntoreinforcementlearn.Thatis,meta-RLusessample-inef?cientMLtolearnsample-ef?cientRLtedasamachinelearningproblemforasignicantperiodoftimeIntriguingly,32Background2.1Reinforcementlearningtoastheagent’senvironment.AnMDPisde?nedbyatupleM=〈s,A,P,P0,R,γ,T),wheresisthesetofstates,Athesetofactions,P(st+11st,at):sxAxs→R+theprobabilityoftransitioningfromstatesttostatest+1aftertakingactionat,P0(s0):s→R+isadistributionApolicyisafunctionπ(a1s):sxA→R+thatmapsstatestoactionprobabilities.Thisway,TPPsat1st)P(st+11st,at).t=0J(π)=Eτ~P(τ)┌t0γtrt┐,ltipleepisodesaregathered.IfHepisodeshavebeengathered,thenD={τh}=0isallofthedatadeneanRLalgorithmasthefunctionf(D):((sxAxR)T)H→Φ.Inpractice,thedatamayincludeath2.2Meta-RLde?nitionisinsteadtolearn(partsof)analgorithmfusingmachinelearning.WhereRLlearnsapolicy,fthehumanfromdirectlydesigningandimplementingtheRLalgorithmsmtomaximizeameta-RLobjective.Hence,fθoutputstheparametersofπφdirectly:φ=fθ(D).Werefertothepolicyπφasthebasepolicywithbaseparametersφ.Here,Disameta-trajectoryylerdinglywemaycalltheouterloopparametersandortheesupportedbyanysetoftasksHoweverorsandAtobesharedbetweenallofthetasksandthetaskstoonly41s-s-1s-d方p1d方1s-s-1s-d方p1d方p1s-s-1sssssdd方p-s-s-11ittedinsweetheτeDK∶H7(θ)=EMi~p(M)┌ED┌G(τ)f│θ,Mi┐┐,τeDK∶HwhereG(τ)isthediscountedreturnintheMDPMiandHisthelengthofthetrial,orthetask-erloopfθ(D).2.3ExamplealgorithmsMetaLearningMAMLwhichusesmetagradientsandFastRLviaSlowRL(RL2),whichusesrecurrentneuralnetworks[46,239].Manymeta-RLalgorithmsbuildonsimilartothoseusedinMAMLandRLwhichmakesthemexcellentMAMLManydesignsoftheinner-loopalgorithmfθbuildonexistingRLalgorithmsandusemeta-learningtoimprovethem.MAML[55]isanin?uentialdesignfollowingthispattern.ItsrsandithgradientdescenttobeagoodstartingpointforlearningontasksfromthetaskdistributionWhenadaptingtoanewtask,MAMLcollectsdatastepforataskMi~p(M):φ=f(D,φ0)=φ0+α5φОJ?(D,πφО),5ptuallyrightwhereJ?(D,πφО)isanestimateofthereturnsofthepolicyπφОforthetaskMiandαistheφ=φ0+β5φОJD1),Mi~p(M)whereπ1isthepolicyfortaskiupdatedoncebytheinner-loop,βisalearningrate,andientDepolicyforvariancereductionhighervaluesofKinitmeralwithKuptodifferencesinthediscountingTooptimizemtheRNNHoweverMAMLcannottrivially6Multi-taskMulti-task-RL2[46,239],MAML[55]-LPGLPGMetaGenRL9]2.4ProblemCategoriesWhilethegivenproblemsettingappliestoallofmetaRLdistinctclustersintheliteraturehaveultitasksettingInthissettinganagentmustquicklyrringtrainingMethodsforthismanyshotsingletasksettingtendto7Meta-LearningFew-ShotMeta-RLMeta-LearningAdaptationGoal MDP1 MDP2 MDP3Rl2,L2RL,VariBADMeta-Learning MDP1 MDP2 MDP3 MAML,DREAMZero-ShotPerformwellfromstartMethods:Few-ShotFreeexplorationphaseMethods:Learnnewtaskswithinafewsteps/episodesOverMultiple(similar)tasks......Meta-LearningMany-ShotMetaMeta-LearningGoalLearnnewtasksbetterthanstandardRLalgorithmsLPG,MetaGenRLMeta-LearningOverMultiple(diverse)tasks MDP1 MDP2 MDP3SolutionsMethods:AdaptationGoalAcceleratestandardRLalgorithmsMeta-LearningOverwindowsinasingletask.(Noreset)SolutionsMethods:STACX,FRODOMeta-LearningAdaptationAdaptationAdaptationMDPMDP18ParameterizedpolicygradientsMAML-likeFinnetal.[55],Lietal.[124],Sungetal.[219],Vuorioetal.[235],ZintgrafMAML-likeDistributionalMAMLndMeta-gradientestimationFoersteretal.[60],Al-Shedivatetal.[207],Stadieetal.[216],Liuetal.[133],Maoetal.[139],Fallahetal.[52],Tang[222],andVuorioetal.[234]BlackboxnerloopHeessetal.[88],Duanetal.[46],Wangetal.[239],Humpliketal.[95],Fakooretal.[51],Yanetal.[256],Zintgrafetal.[281],Liuetal.[130],andZintgrafetal282]AttentionMishraetal.[150],Fortunatoetal.[62],Emukpereetal.[49],Ritteretal.[190],Wangetal.[240],andMelo[141]HypernetworksXianetal.[250]andBecketal.[17]TaskInferenceMulti-taskpre-trainingHumpliketal.[95],Kamiennyetal.[104],Raileanuetal.[182],Liuetal.[130],andPengetal.[174]LatentforZhouetal.[278],Raileanuetal.[182],Zintgrafetal.[281],Zhangetal.[268],Zintgrafetal.[282],Becketal.[17],Heetal.[86],andImagawaetal.97]ConstrastivelearningFuetal.[64]rnerwouldacgthis3Few-ShotMeta-RLkinhomekitchensTraininganewereitntocookinitHowevertrainingsuchanagentwithmetaRLinvolvesuniquefew-shotsetting.Recallthatmeta-RLitselflearnsalearningalgorithmfθ.Thisplacesunique?Parameterizedpolicygradientmethodsbuildthestructureofexistingpolicygradiente9PPGMethodBlackBoxMethodGeneralizationeralizationalizationAMLalizationAMLLrereInductivebiasinstructureInductiveInductivebiasinstructureInductivebiasfromdatachallengesOnesuchrningsnsupervision.Inthestandardmeta-RLproblemsetting,rewardsareavailableduringbothmeta-ample,itmaybedif?culttomanuallydesignaninformativetaskdistributionformeta-training,metanges3.1ParameterizedPolicyGradientMethodsMeta-RLlearnsalearningalgorithmfθ,theinner-loop.WecalltheparameterizationoffθthesectionwediscussonewayofparameterizingtheinnerloopthatbuildsinthestructureofexistingstandardRLalgorithms.Parameterizedpolicygradients(PPG)φj+1=fθ(Dj,φj)=φj+αθ5φjJ?θ(Dj,πφj),teverφj+1=φj+αθMθ5φjJ?θ(Dj,πφj)[255,170,58].Whileavaluebased-methodcouldbeusedcanbeupdatedwithback-propagationinaPPGmethodorbyaneuralnetworkinablackboxodslearnafulldistributionoverinitialpolicyparameters,p(φ0)[82,260,242,285,73].Thisterstion?tviavariationalinference[82,73].Moreover,thedistributionitselfcanbeupdatedintheyweightsandbiasesofthelastlayerofthepolicy[181],whileleavingtherestoftheparametersectorditionedInthiscasetheinputtothepolicyitselfparameterizesaMeta-gradientestimationinouter-loopoptimizationEstimatinggradientsfortheouter-loopisnnerloopThereforeoptimizingtheouter-looprequirestakingthegradientofagradient,orameta-gradient,whichinvolvesofdatausedbyinnerlooponpriortedbydataintheouterloopStillthesepriorpoliciesdoaffectthedistributionofdatasampledinD,usedlaterbytheinner-looplearningalgorithm.Thusignoringthegradienttermsinthepolicyentpwithanmethodmayusearstorderapproximation63],orusegradient-freeoptimizationtoopti-Outer-loopalgorithmsWhilemostPPGmethodsuseapolicy-gradientalgorithmintheouter-saDAdditionally,onecantraintask-speci?cexpertsandthenusetheseforimitationlearninginthetorybehaviorbyoptimizingEquationtheycaneoverPPG3.2BlackBoxMethodsauniversalfunctionapproximator.ThisplacesfewerconstraintsonthefunctionfθthanwithaedbystructureByconditioningapolicyonacontextvector,alloftheweightsandbiasesofTmustgeneralizebetweenalltasksHoweverwhensignicantlydistinctpoliciesarerequiredfordifferenttasks,cydirectlyTheinnerloopmayproducealloftheparametersofafeedInner-looprepresentationWhilemanyblackboxmethodsuserecurrentneuralnetworks,[88,opionmechaexOuter-loopalgorithmsWhilemanyblackboxmethodsuseon-policyalgorithmsintheouter-loop[46,239,281],itisstraightforwardtouseoff-policyalgorithms[185,51,130],whichbringBlackboxtrade-offsOnekeybene?tofblackboxmethodsisthattheycanrapidlyaltertheirnoftenstruggletogeneralizeoutsideofpM,252].Considertherobotchef:whileitkboxingafullyblack-boxmethod,thepolicyorinner-loopcanbe?ne-tunedwithpolicygradientsat3.3TaskInferenceMethodsritrainingforeachtask,withnoplanningrequired.Infact,trainingapolicyoveradistributionoftasks,withaccesstothetruetask,canbetakenasthede?nitionofmulti-taskRL[263].InthedsmapthetaskdirectlytoweightspolicyheasTaskinferencewithprivilegedinformationAstraightforwardmethodforinferringthetaskistokcMionoftionwnTaskinferencewithmulti-tasktrainingSomeresearchusesthemulti-tasksettingtoimproventedtourinonthatencodesthetaskreprensitcontainsonlythisinformation[95,130].Afterthis,gθ(cM)canbeinferredinmeta-learningtaskRLmaybeisneededforthemeta-RLpolicytoidentifythetask.InthiscaseinsteadofonlyinferringthefcientlymanyexploratorythetasksharingpoliciesbecomeslessfeasibleOftenintrinsicrewardsareTaskinferencewithoutprivilegedinformationOthertaskinferencemethodsdonotrelyonForinstanceataskcanbeonortransitionfunction[278,281,268,280,86];andtaskinferencecanusecontrastivelearningHepisodesxAxAxAxAxAxAxAxAxxAKepisodestrationoffreeexplorationinrstKepisodesyellowfollowedbynotfreeexploedbyexploitationwhitedistributionusingavariationalinformationbottleneckesldtoreheotherhandtrainingthehattion3.4ExplorationandMeta-ExplorationshouldworkforanyMDPandmayconsistofrandomon-policyexploration,epsilon-greedyex-istypeofexplorationstilloccursintheadditionallyexistsexplorationintheZhouetal.[278],Gurumurthyetal.[83],Fuetal.[64],Liuetal.[130],andZhangetal.[268]pMToenablesampleefcientadaptationduseddistribution.Recallthatinthefew-shotadaptationsetting,oneachtrial,theagentisplacedintoanewtaskonsolvingthetaskinthenextfewepisodes(i.e.,overtheH_KepisodesinEquation3).Anduringtentiallyevenbeyondtheinitialfewshotswithexploitingwhatitalreadyknowstoachievehighrewards.Itisalwaysoptimaltoexploreinthe?rstKepisodes,sincenoicingshorttermrewardstolearnabetterpolicyforhigherlaterreturnspaysdividends,whilewhenH_Kissmall,theagentmustexploitmoretoobtainanyrewarditcan,optimallyEnd-to-endoptimization.Perhapsthesimplestapproachistolearntoexploreandexploitend-to-endbydirectlymaximizingthemeta-RLobjective(Equation3)asdonebyblackboxmeta-RLapproaches[46,239,150,216,26].Approachesinthiscategoryimplicitlylearntoexplore,astheydirectlyoptimizethemeta-RLobjectivewhosemaximizationrequiresexploration.Morespeci?cally,thereturnsinthelaterK_HepisodesτeDG(τ)canonlybemaximizedifthepolicyappropriatelyexploresinthe?rstKepisodes,somaximizingthemeta-RLobjectivecanyieldoptimalexplorationinprinciple.Thisapproachworkswellwhencomplicatedexplorationstrategiesarenotneeded.Forexample,ifattemptingseveraltasksinthedistributionoftasksisareasonableformofexplorationforaparticulartaskdistribution,thenend-to-endoptimizationmayworkwell.ingredients(i.e.,explore)ifdoingsoresultsinacookedmeal.Hence,itischallengingtolearnLriorsamplingTocircumventthechallengeofimplicitlylearningtoexploreRakellyetalwhattheidentityofthetaskis,andthentoiterativelyre?nethisdistributionbyinteractingwithviahatsalongtsinitialpositionrenttaddningthedynamicsandrewardfunctioninformationgainoverthetaskdistribution[64,130],orareductioninuncertaintyoftheposte-?rstKepisodes,andthentheexploitationpolicyexploitsfortheremainingH_Kexploitationinformationaboutthetaskdynamics,butareirrelevantforarobotcheftryingtocookameal.cyusedhespaceofxxOptimal0-shotPosteriorSamplingIrrelevantExplorationAxxAxx AEpisode1xxAxxAxxAEpisode2xxAxxAxxAEpisode30imalexplorationandposteriorsamplingThethirdrowconsideringthatthisintrinsicrewardcanbeusedtotrainapolicyexclusivelyforoff-policydatasnotForexample,usingrandomnetworkdistillation[29],arewardmayaddanincentivefornovelty[282],oraddanincentiveforgettingdatawhereTD-errorishigh[77].Manyoftheserewards3.5Bayes-AdaptiveOptimalityrtaintyInsteadoptimalexplorationonlyreducesuncertaintygsmeforexplorationislimitedThereforesiscussximateBayesoptimalpoliciesandanalyzethebehaviorofBayes-adaptiveMarkovdecisionprocesses.Todeterminetheoptimalexplorationstrategy,weicsandrewardfunctionFromahighleveltheBayesadaptiveMarkovdeciBAMDPmaximizesreturnswhenplacedintoanunknownMDP.Crucially,thedynamicsofthearacterizesthecurrentuncertaintyasadistributionoverpotentialtionstsarst)sofar,andtheinitialbeliefb0isapriorp(r,p).Then,thestatesofntySpecicallytheBAMDPrewardR+(st,bt,at)=ER~bt[R(st,at)].(4)heBAMDP:P+(st+1,bt+11st,bt,at)=ER,P~bt[P(st+11st,at)δ(bt+1=p(R,P1τ:t+1)].(5)hecurrentbeliefr=R+(st,bt,at)=ER~bt[R(st,at)].EbRstbtatLearninganapproximateBayes-optimalpolicyDirectlycomputingBayes-optimalpoliciesre-onandthelatentvariablesmcanbelearnedbyrollingoutthepolicytoobtainonarnBayesadaptiveoptimalpoliciestheframeworkofBAMDPscanstillofferahelpfulFirst,blackboxmeta-RLalgorithmssuchasRL2learnarecurrentpolicythatnotonlycondi-tionsonthecurrentstatest,butonthehistoryofobservedstates,actions,andrewardsτ:t=memputingthebeliefstateetaRLalgorithmscaninprinciplelearnBayesadaptiveatetosmetaRLalgorithmsstruggletolearnischallengingLiuetalhighlightonesuchoptimizationchallengeforblackboxmeta-RLwheretheagentisgivenafew“free”episodestoexplore,andtheobjectiveistomaximizethernsbeginningfromthersttimestepThesetheresultinusinglesssuitableutensilsoringredients,though,especiallywhenoptimizedatlowerlveinterhecurrenttaskwhichisequivalenttothebeliefstateThenexplocycanbesuf?cientforoptimallysolvingthemeta-RLproblem,evenifitdoesnotmakeuseofallthisstate3.6SupervisionInthissection,wediscussmostofthedifferenttypesofsupervisionconsideredinmeta-RL.Inexperttrajectoriesorotherprivilegedinformationduringmeta-trainingand/ortesting).EachofMeta-RLMeta-RLwithMeta-RLviaImitationHYPER
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 開業(yè)慶典致辭(集合15篇)
- 客服年度個(gè)人工作總結(jié)(15篇)
- 幼兒園飲早茶主題活動(dòng)
- 2015山西道法試卷+答案+解析
- 建行的收入證明15篇
- 山東省濰坊市高三上學(xué)年階段性監(jiān)測語文試題(含答案)
- 智研咨詢重磅發(fā)布:2024年中國6C超充電池行業(yè)供需態(tài)勢、市場現(xiàn)狀及發(fā)展前景預(yù)測報(bào)告
- 2024年中國液氫容器行業(yè)投資前景分析、未來發(fā)展趨勢研究報(bào)告(智研咨詢發(fā)布)
- 基于深度強(qiáng)化學(xué)習(xí)的視覺SLAM參數(shù)自適應(yīng)研究
- 鋼鐵行業(yè)客服工作總結(jié)
- 2025年個(gè)人土地承包合同樣本(2篇)
- (完整版)高考英語詞匯3500詞(精校版)
- 2024-2025年突發(fā)緊急事故(急救護(hù)理學(xué))基礎(chǔ)知識(shí)考試題庫與答案
- 左心耳封堵術(shù)護(hù)理
- 2024年部編版八年級語文上冊電子課本(高清版)
- 合唱課程課件教學(xué)課件
- 2024-2025學(xué)年廣東省大灣區(qū)40校高二上學(xué)期聯(lián)考英語試題(含解析)
- 2024-2030年電炒鍋項(xiàng)目融資商業(yè)計(jì)劃書
- 《公路勘測細(xì)則》(C10-2007 )【可編輯】
- 鋼鐵是怎樣煉成的手抄報(bào)
- 防火墻漏洞掃描基礎(chǔ)知識(shí)
評論
0/150
提交評論