在大規(guī)模語言模型時代構(gòu)建自主系統(tǒng) Building Agentic Systems in an Era of Large Language Models_第1頁
在大規(guī)模語言模型時代構(gòu)建自主系統(tǒng) Building Agentic Systems in an Era of Large Language Models_第2頁
在大規(guī)模語言模型時代構(gòu)建自主系統(tǒng) Building Agentic Systems in an Era of Large Language Models_第3頁
在大規(guī)模語言模型時代構(gòu)建自主系統(tǒng) Building Agentic Systems in an Era of Large Language Models_第4頁
在大規(guī)模語言模型時代構(gòu)建自主系統(tǒng) Building Agentic Systems in an Era of Large Language Models_第5頁
已閱讀5頁,還剩146頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

BuildingAgenticSystemsinanEraofLargeLanguageModels

CharlesPacker

ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley

TechnicalReportNo.UCB/EECS-2024-223

/Pubs/TechRpts/2024/EECS-2024-223.html

December19,2024

Copyright?2024,bytheauthor(s).

Allrightsreserved.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor

personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare

notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.

Fall2024

BuildingAgenticSystemsinanEraofLargeLanguageModels

By

CharlesPacker

Adissertationsubmittedinpartialsatisfactionoftherequirementsforthedegreeof

DoctorofPhilosophy

in

ComputerScience

inthe

GraduateDivision

ofthe

UniversityofCalifornia,Berkeley

Committeeincharge:

ProfessorJosephE.Gonzalez,ChairProfessorIonStoica

ProfessorMateiZahariaDoctorYuandongTian

BuildingAgenticSystemsinanEraofLargeLanguageModels

Copyright2024by

CharlesPacker

1

Abstract

BuildingAgenticSystemsinanEraofLargeLanguageModels

by

CharlesPacker

DoctorofPhilosophyinComputerScience

UniversityofCalifornia,BerkeleyProfessorJosephE.Gonzalez,Chair

Buildingintelligentautonomoussystemsthatcanreason,adapt,andinteractwiththeirenvironmenthasbeenalong-standinggoalinartificialintelligence.Thisthesisexplorestheevolutionofagenticsystemsthroughthedeeplearningrevolution,fromreinforcementlearningtomodernLargeLanguageModels(LLMs),focusingonthecriticalcomponentsneededtocreatereliableautonomousagents.

First,weaddressthefundamentalchallengeofgeneralizationindeepreinforcementlearn-ing(RL),introducingasystematicframeworkforevaluatingandimprovinghowlearnedpoli-ciestransferacrossenvironments.Buildingonthisfoundation,wepresentHindsightTaskRelabeling(HTR),anovelapproachthatenablesmeta-RLalgorithmstolearnadaptationstrategiesinsparserewardsettingswithoutrequiringdenserewardsignalsduringtraining.

Finally,weaddresstheemergingchallengesofbuildingreliableagentsusingLargeLan-guageModels.WhileLLMsdemonstrateunprecedentedreasoningcapabilities,theireffec-tivenessasautonomousagentsislimitedbyfundamentalconstraintsintheirarchitecture-mostnotably,theirstatelessnatureandfixedcontextwindows.WepresentMemGPT,anoperatingsystem-inspiredframeworkthatenablesLLMstomanagetheirownmemoryandstate,introducingconceptslikevirtualcontextmanagementandself-directedmemoryopera-tions.MemGPTdemonstratesthatbytreatingLLMsasanewfundamentalunitofcompute-analogoustohowCPUswerethefundamentalunitintraditionaloperatingsystems-wecanbuildmorereliableandcapableautonomousagents.

Together,thesesystemstracetheevolutionofagenticAIsystemsandprovidekeybuild-ingblocksforcreatingmorereliableandcapableautonomousagents.Byaddressingcorechallengesingeneralization,adaptation,andmemorymanagement,thisthesisestablishesafoundationforengineeringthenextgenerationofAIsystemsthatcaneffectivelyreasonandinteractwiththeworld.

i

Tomyparents

ii

Contents

ListofFigures

v

ListofTables

ix

Acknowledgments

x

1Introduction

1

1.1Background

1

1.1.1TheDeepLearningRevolutioninRoboticsandControl

1

1.1.2TheRiseofFoundationModels

2

1.2DeepLearningforAgenticSystems

2

1.3TheLLMAgentParadigm

3

2AssessingGeneralizationinDeepReinforcementLearning

4

2.1Introduction

4

2.2Background

6

2.3Notation

7

2.4Algorithms

8

2.5Environments

9

2.6Experimentalsetup

11

2.7Experimentalsetup

12

2.8Resultsanddiscussion

14

2.9Conclusion

15

2.10Additionaldetails

16

2.10.1EnvironmentDetails

16

2.10.2TrainingHyperparameters

16

2.10.3DetailedExperimentalResults

18

2.10.4BehaviorofMountainCar

18

2.10.5TrainingCurves

21

2.10.6Videosoftrainedagents

21

Contentsiii

3HindsightTaskRelabelling:ExperienceReplayforSparseRewardMeta-

RL

26

3.1Introduction

26

3.2Relatedwork

27

3.3Background

28

3.3.1Meta-ReinforcementLearning(Meta-RL)

29

3.3.2Off-PolicyMeta-ReinforcementLearning

29

3.3.3HindsightExperienceReplay

30

3.4LeveragingHindsightinMeta-ReinforcementLearning

31

3.4.1AlgorithmDesign

32

3.4.2SingleEpisodeRelabeling(SER)strategy

33

3.4.3EpisodeClustering(EC)strategy

33

3.4.4ComparisonofHTRandHER

34

3.4.5Limitations

34

3.5Experiments

35

3.5.1Environments

35

3.5.2HTRenablesmeta-trainingusingonlysparsereward

36

3.5.3Varyingkeyhyperparameters

38

3.6Conclusion

39

3.7ExperimentalSetup(additionaldetails)

40

3.7.1ComputingInfrastructure

40

3.7.2Hyperparameters

40

3.7.3RewardFunctions

40

3.7.4ChangingtheDistancetoGoal

41

3.8AlgorithmSpecifics

41

3.8.1Sample-TimevsDataGenerationRelabelling

41

3.8.2SingleEpisodeRelabellingImplementationDetails

41

3.8.3EpisodeClusteringImplementationDetails

42

3.8.4TimeandSpaceComplexity

43

4MemGPT:TowardsLLMsasOperatingSystems

44

4.1Introduction

44

4.2MemGPT(MemoryGPT)

46

4.2.1Maincontext(prompttokens)

46

4.2.2QueueManager

47

4.2.3Functionexecutor(handlingofcompletiontokens)

47

4.2.4Controlflowandfunctionchaining

48

4.3Experiments

49

4.4Experiments

49

4.4.1MemGPTforconversationalagents

50

4.4.2MemGPTfordocumentanalysis

52

4.5Relatedwork

55

Contentsiv

4.6Conclusion

56

4.7Additionaldetails

56

4.7.1Limitations

56

4.7.2MemGPTpseudocode

57

4.7.3MemGPTfunctionset

58

4.7.4Promptsandinstructions

61

4.7.5BalancingWorkingContextandtheFIFOQueue

67

5FromServingModelstoServingAgents:TheMissingPiecesforSupport-

ingAgenticWorkloads

69

5.1Introduction

69

5.1.1TheExistingStatelessLLMProgrammingModel

69

5.1.2AgenticProgrammingModel

70

5.1.3AgentState

70

5.2TheAgentHostingLayer

70

5.2.1LLMInference:Co-optimizationwiththeinferencelayer

71

5.2.2State&ContextManagement

71

5.2.3Multi-agentcommunicationandorchestration

71

6Conclusion&FutureWork

72

Bibliography

74

v

ListofFigures

2.1Schematicofthethreeversionsofanenvironment

17

2.2MountainCar:heatmapoftherewardsachievedbyA2CwiththeFFarchitecture

onDRandDE.TheaxesarethetwoenvironmentparametersvariedinRandE.

22

2.3Pendulum:heatmapoftherewardsachievedbyA2CwiththeFFarchitecture

onDRandDE.TheaxesarethetwoenvironmentparametersvariedinRandE.

23

2.4PPOwithFFarchitecture

24

2.5PPOwithRCarchitecture

24

2.6EPOpt-PPOwithFFarchitecture

24

2.7EPOpt-PPOwithRCarchitecture

24

2.8RL2-PPO

24

2.9TrainingcurvesforthePPO-basedalgorithmsonCartPole,allthreeenvironment

versions.Notethatthedecreaseinmeanepisoderewardat10000episodesinthe

twoEPOpt-PPOplotsisduetothefactthatittransitionsfrombeingcomputed

usingallgeneratedepisodes(?=1)toonlythe10%withlowestreward(?=0.1).

24

2.10VideoframesofagentstrainedwithA2ConHalfCheetah,trainedintheDeter-

ministic(D),Random(R),andExtreme(E)settings(fromtoptobottom).All

agentsevaluatedintheDsetting

25

2.11VideoframesofagentstrainedwithPPOonHalfCheetah,trainedintheDeter-

ministic(D),Random(R),andExtreme(E)settings(fromtoptobottom).All

agentsevaluatedintheDsetting

25

ListofFiguresvi

3.1Ingoal-conditionedRL(a),anagentmustnavigatetoaprovidedgoallocationg

(filledcircle,revealedtotheagent).Anunsuccessfulattemptforgoalgprovides

nosparserewardsignal,butcanberelabelledasasuccessfulattemptforgoalg′,

creatingsparserewardthatcanbeusedtotraintheagent.Inmeta-RL(b),the

taskT(i.e.,goal,hollowcircle)isneverrevealedtotheagent,andinsteadmust

beinferredusingexperienceonpriortasksandlimitedexperience(τ1:t?1)onthe

newtask.In(b),thereisnosharedoptimaltaskT′torelabelallattemptswith.

HTRrelabelseachattemptτunderitsownhindsighttaskT′,andmodifiesthe

underlyingmeta-RLtraininglooptolearnadaptationstrategiesontherelabelled

tasks.Notethatweincludemultipletrajectoriesτin(b)vsasingletrajectory

in(a)tohighlighttheadaptationstageinmeta-RL,whichdoesnotexistin

goal-conditionedRLandrequiressignificantlydifferentsamplingandrelabeling

procedures

27

3.2Sparserewardenvironmentsformeta-RLthatrequiretemporally-extendedex-

ploration.Ineachenvironment,thetask(thetop-leftcirclein(a),thegreen

spherein(b)and(c))isnotrevealedtotheagentviatheobservation.Theagent

mustinsteadinferthetaskthroughtemporally-extendedexploration(illustrated

bythedottedlinesin(a)),sincenorewardsignalisprovideduntilthetaskis

successfullycompleted.Priormeta-RLmethodssuchasPEARL(.Rakellyetal

2019)andMAESN(Guptaetal.2018b)areonlyableto(meta-)learnmeaning-

fuladaptationstrategiesusingdenserewardfunctions.Ourapproach,Hindsight

TaskRelabeling(HTR),can(meta-)trainwiththeoriginalsparserewardfunction

anddoesnotrequireadditionaldenserewardfunctions

30

3.3IllustrationofHindsightTaskRelabeling(HTR)inameta-RLtrainingloop.

HTRisagnostictotheunderlying(off-policy)meta-RLalgorithm;theagent

architectureand/ortrainingspecifics(e.g.,theencoderφ,actorπandQ-function

neuralnetworksshowninblue)canbemodifiedindependentlyoftherelabeling

scheme.HTRcanalsobeperformedinan‘eager’fashionatthedatacollection

stage(asopposedto‘lazy’relabelinginthedatasamplingstage),seeSection3

fordetails

31

3.4HTRalgorithm

33

3.5Evaluatingadaptationtotraintasksprogressivelyduringmeta-training.Y-

axismeasuresaveragesparsereturnduringadaptationthroughoutmeta-training

(shadedstddev),thoughtheoracleisstilltrainedusingdensereward.Conven-

tionalmeta-RLmethodsstruggletolearnusingsparsereward.HindsightTask

Relabeling(HTR)iscomparabletodenserewardmeta-trainingperformance

36

3.6Evaluatingadaptationtotesttasksaftermeta-training.Y-axismeasuresaverage

(sparse)returnduringadaptationusingcontextcollectedonline,usingsparsere-

wardonly.AdaptationstrategieslearnedwithHindsightTaskRelabeling(HTR)

generalizetoheld-outtasksaswellastheoraclewhichislearnedusingshapedre-

wardfunctions.WithoutHTRoraccesstoashapedrewardduringmeta-training,

theagentisunabletolearnareasonablestrategy

37

ListofFiguresvii

3.7Visualizingexplorationbehaviorlearnedduringmeta-trainingusing300pre-

adaptationtrajectories(i.e.,sampledfromthelatenttaskprior).Inthesparse

rewardsetting,withoutHTR(middlerow)theagentisunabletolearnameaning-

fulexplorationstrategyandappearstoexplorerandomlyneartheorigin.With

HTR(bottomrow),theagentlearnstoexplorenearthetruetaskdistribution

(greycircles),similartoanagenttrainedwithashapeddenserewardfunction

(toprow)

38

3.8ComparingHTRwithSERvsEConPointRobot

38

3.9AveragereturnwhenvaryingKonPointRobot

38

3.10AveragetaskdistancewhenvaryingKonPointRobot

38

3.11RelativerewardsignalfromhindsightvsgroundtruthtasksusingPointRobot.

39

3.12Meta-trainingonPointRobotwithvaryinggoaldistances.Ifthedistanceto

thegoalisshortenoughforrandomexplorationtoleadtosparsereward,meta-

trainingispossibleusingonlythesparserewardfunction.Oncethisisnolonger

thecase,meta-trainingisonlypossiblewithaproxydenserewardfunction,or

byusingHindsightTaskRelabellingontheoriginalsparserewardfunction

41

3.13IllustrationofHindsightTaskRelabeling(HTR)usingEpisodeClustering(EC)

inameta-RLtrainingloop,whererelabellingoccursatthedatacollectionstage.

42

4.1MemGPTwritesdatatopersistentmemoryafteritreceivesasystemalertabout

limitedcontextspace

45

4.2MemGPTcansearchout-of-contextdatatobringrelevantinformationintothe

currentcontextwindow

45

4.3InMemGPT,afixed-contextLLMprocessorisaugmentedwithahierarchical

memorysystemandfunctionsthatletitmanageitsownmemory.TheLLM’s

prompttokens(inputs),ormaincontext,consistofthesysteminstructions,work-

ingcontext,andaFIFOqueue.TheLLMcompletiontokens(outputs)arein-

terpretedasfunctioncallsbythefunctionexecutor.MemGPTusesfunctions

tomovedatabetweenmaincontextandexternalcontext(thearchivalandre-

callstoragedatabases).TheLLMcanrequestimmediatefollow-upLLMin-

ferencetochainfunctioncallstogetherbygeneratingaspecialkeywordargu-

ment(request_heartbeat=true)initsoutput;functionchainingiswhatallows

MemGPTtoperformmulti-stepretrievaltoansweruserqueries

46

lected1/2024).*Approximatessagecounassumingaprepromptof1ktokens,

4.4ComparingcontextlengthsofcommonlyusedmodelsandLLMAPIs(datacol-

andanaveragemessagesizeof50tokens(250characters)

48

4.5AnexampleconversationsnippetwhereMemGPTupdatesstoredinformation.

Heretheinformationisstoredinworkingcontextmemory(locatedwithinthe

prompttokens)

48

ListofFiguresviii

4.6DocumentQAtaskperformance.MemGPT’sperformanceisunaffectedby

increasedcontextlength.Methodssuchastruncationcanextendtheeffective

contextlengthsoffixedlengthmodelssuchasGPT-4,butsuchcompression

methodswillleadtoperformancedegradationasthenecessarycompressiongrows.

RunningMemGPTwithGPT-4andGPT-4Turbohaveequivalentresultsonthis

task

52

4.7AnexampleofMemGPTsolvingthedocumentQAtask.AdatabaseofWikipedia

documentsisuploadedtoarchivalstorage.MemGPTqueriesarchivalstoragevia

functioncalling,whichpullspaginatedsearchresultsintomaincontext

52

4.8NestedKVretrievaltaskperformance.MemGPTistheonlyapproach

thatisabletoconsistentlycompletethenestedKVtaskbeyond2nestinglevels.

WhileGPT-4Turboperformsbetterasabaseline,MemGPTwithGPT-4Turbo

performsworsethanMemGPTwithGPT-4

54

4.9AnexampleofMemGPTsolvingthenestedKVtask(UUIDsshortenedforread-

ability).Theexamplekey-valuepairhastwonestinglevels,andtheMemGPT

agentreturnsthefinalanswerwhenaqueryforthefinalvalue(f37 617)only

returnsoneresult(indicatingthatitisnotalsoakey)

54

4.10MemGPTalgorithmpseudocode

57

ix

ListofTables

2.1Generalizationperformance(in%success)ofeachalgorithm,averagedoverall

environments(meanandstandarddeviationoverfiveruns)

14

2.2Rangesofparametersforeachversionofeachenvironment,usingsetnotation

17

2.3Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onAcrobot

18

2.4Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onCartPole

19

2.5Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onMountainCar

19

2.6Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onPendulum

20

2.7Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onHalfCheetah

20

2.8Meanandstandarddeviationoverfiverunsofgeneralizationperformance(in%

success)onHopper

21

4.1Deepmemoryretrieval(DMR)performance.Inthistask,theagentisaskeda

specificquestionaboutatopicdiscussedinapriorconversation(sessions1–5).

Theagent’sresponseisscoredagainstthegoldanswer.MemGPTsignificantly

outperformsthefixed-contextbaselines.‘R-L’isROUGE-L

49

4.2Conversationopenerperformance.Theagent’sconversationopenerisevaluated

usingsimilarityscorestothegoldpersonalabels(SIM-1/3)andtothehuman-

createdopener(SIM-H).MemGPTisabletoexceedtheperformanceofthe

human-createdconversationopenerwithavarietyofunderlyingmodels

49

x

Acknowledgments

Firstandforemost,Iwanttothankmyfamily,whoalwayspushedmetoachievemore.TheyarethereasonIlovetodohardthings.

NextIwouldliketothankmyadvisor,ProfessorJosephE.Gonzalez.JoeyhelpedmeachievemyonetruegoalinthePhD:tomakesciencefictionintosciencereality.Hisflexibilityandencouragement,regardlessofwheremyresearchinterestsled(evenwhennotdirectlyinhiscriticalresearchpath),wereinstrumentaltomysuccess.IcouldnothaveaskedforabetterPhDadvisor.

Iamalsodeeplygratefultomyotherthesiscommitteemembers:IonStoica,MateiZaharia,andYuandongTian.HavingsuchrenownedworldexpertsinAIandsystemsresearchonmycommitteewasanincrediblehonor.

MyjourneyinAIresearchbeganatUCSanDiego,whereIworkedwithProfessorsJulianMcAuleyandKamalikaChaudhuriasanundergraduate.ThisledtomyworkwithProfessorLawrenceHolderduringanREUatWashingtonStateUniversity,whereIwrotemyfirstfirst-authorpaper.Aftergraduation,ProfessorDawnSongtookachanceonme,hiringmeafterabriefchatataStarbucksinHayesValley-amomentthatbroughtmetoBerkeleyandsetmeonmypathtowardthePhD.

SeveralmentorswerecrucialtomydevelopmentasaresearcherduringmytimeatBerke-ley.VladlenKoltuntaughtmeinvaluablelessonsaboutresearchdiscipline,particularlyaboutknowingwhentoabandon‘zombie’researchprojects-adviceIwishIhadfollowedmoreclosely.RichardShinandKatelynGaoworkedcloselywithmeduringmyfirsttwoyearsatBerkeleyandweregreatmentors.OnceIbeganthePhD,RowanMcAllisterandNickRhinehartguidedmyresearchinautonomousvehiclesandhelpedmaintainmyresearchmo-mentumduringthechallengingmiddleyearsofmyPhD.I’malsogratefultoPieterAbbeelandSergeyLevine,who,thoughnotmyformaladvisors,providedcrucialfeedbackthathelpedseveralpaperscrossthefinishlinetopublication.

TheRISELabwasanincrediblehomeformyresearch.Iwasfortunatetoworkalong-sideamazingcolleaguesinJoey’sgroup:KevinLin,LisaDunlap,JustinWong,ShishirPatil,TianjunZhang,ParasJain,SukritKalra,andSuziePetryk.Theinfamous"StarFactory"cubicle,whichallegedlyhousedtheDatabricksfoundersandlatertheAnyscalefounders,becamethebirthplaceofMemGPT,Gorilla,andSkyPlaneduringmytimethere-anunmatcheddensityofopensourceresearchcontributionsinasinglecubiclespace.

Andfinally,IwouldliketothankSarahWoodersandKevinLin,whoarejoiningmeonan

Acknowledgmentsxi

excitingnewadventurepost-PhD,wherewe’llbetakingourresearchoncontextmanagementforLLMagentsintotherealworld.

Thisthesis,andthejourneyitrepresents,wouldnothavebeenpossiblewithoutthesupport,guidance,andencouragementofalltheseincrediblepeople.Thankyou.

Additionalcontextaroundthisthesis:Thisthesiswaswrittenduringanextraordinaryperiodinartificialintelligenceresearch(2017-2024).WhenIbeganmyPhD,deepreinforce-mentlearningwasattheforefrontofautonomoussystemsresearch,withbreakthroughslikeAlphaGoandOpenAIFivedemonstratingsuperhumanperformanceincomplexgames.

Thencamethetransformerrevolution.Whatstartedasincrementalimprovementsinnaturallanguageprocessingrapidlyevolvedintosomethingfarmoreprofound.ThereleaseofChatGPTinlate2022markedaparadigmshiftnotjustinAIresearch,butinhowsocietyviewedartificialintelligence.LargeLanguageModelsdemonstratedcapabilitiesthatseemedimpossiblejustafewyearsearlier:sophisticatedreasoningandintelligencethatwasgeneral.

Ihadtheuniqueprivilegeofnotjustwitnessingthisrevolution,butactivelyparticipatinginit.Myresearchjourneyparalleledthistransition:fromworkingonfundamentalchallengesindeepreinforcementlearning,toultimatelyhelpingpioneernewapproachesforbuildingreliableautonomoussystemsusingLargeLanguageModels.Thisthesisreflectsboththe‘before’and‘a(chǎn)fter’ofthispivotalmomentinAIhistory;atimethatwilllikelyberememberedasthebeginningofthefoundationmodelera.

Thespeedofprogressduringthisperiodwasunprecedented.Papersthatseemedcutting-edgewhenIstartedmyPhDquicklybecamehistoricalartifacts.Researchdirectionsthatappearedpromisingweresuddenlyobsolete.Yetthisrapidevolutioncreatedextraordinaryopportunitiestocontributetogenuinelynewdirectionsincomputerscience:tohelpestab-lishthefoundationsforhowwebuildAIsystemsinthisnewera.

Thisthesisrepresentsmysmallcontributiontothisremarkableperiodincomputinghistory.

1

Chapter1

Introduction

Buildingintelligentautonomoussystemsthatcaneffectivelyreason,adapt,andinteractwiththeirenvironmenthasbeenalongstandinggoalinartificialintelligence.Therecentdeeplearningrevolution,particularlytheemergenceofLargeLanguageModels(LLMs),hasdramaticallychangedourapproachtobuildingsuchsystems.Thisthesistracesthisevolutionthroughseveralkeyadvancesinbuildingagenticsystems,fromdeepreinforcementlearningtomodernLLM-basedapproaches,focusingonthecriticalcomponentsneededtocreatereliableautonomousagents.

1.1Background

Thedevelopmentofagenticsystemshasundergoneseveralsignificantparadigmshifts,eachintroducingnewcapabilitiesandchallenges.Understandingtheseshiftsandtheirim-plicationsiscrucialforbuildingeffectiveautonomousagents.

1.1.1TheDeepLearningRevolutioninRoboticsandControl

Theintegrationofdeepneuralnetworkswithreinforcementlearningmarkedasignificantadvancementinautonomoussystems.Thiscombinationenabled:

?End-to-EndLearning:DeepRLallowedsystemstolearndirectlyfromrawsensoryinput,eliminatingtheneedforhand-engineeredfeatures.

?ComplexPolicyLearning:Neuralnetworksasfunctionapproximatorsenabledlearningsophisticatedcontrolpoliciesforhigh-dimensionaltasks.

?ImprovedGeneralization:Deeparchitecturespromisedbettertransferoflearnedbe-haviorsacrosssimilartasks.

However,severalkeychallengesemerged:

1.2.DEEPLEARNINGFORAGENTICSYSTEMS2

?LimitedGeneralization:Learnedpoliciesoftenfailedtotransferbeyondtheirspecifictrainingconditions

?SampleInefficiency:DeepRLsystemsrequiredextensiv

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論