基于大语言模型的智能体优化研究综述_第1页
基于大语言模型的智能体优化研究综述_第2页
基于大语言模型的智能体优化研究综述_第3页
基于大语言模型的智能体优化研究综述_第4页
基于大语言模型的智能体优化研究综述_第5页
已阅读5页,还剩59页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

ASurveyontheOptimizationofLargeLanguageModel-basedAgents

arXiv:2503.12434v1[cs.AI]16Mar2025

SHANGHENGDU

,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniver-sity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

JIABAOZHAO∗,SchoolofComputerScienceandTechnology,DonghuaUniversity,China

JINXINSHI,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

ZHENTAOXIE,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

XINJIANG,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

YANHONGBAI,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

LIANGHE,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

WiththerapiddevelopmentofLargeLanguageModels(LLMs),LLM-basedagentshavebeenwidelyadoptedinvariousfields,becomingessentialforautonomousdecision-makingandinteractivetasks.However,currentworktypicallyreliesonpromptdesignorfine-tuningstrategiesappliedtovanillaLLMs,whichoftenleadstolimitedeffectivenessorsuboptimalperformanceincomplexagent-relatedenvironments.AlthoughLLMoptimizationtechniquescanimprovemodelperformanceacrossmanygeneraltasks,theylackspecializedoptimizationtowardscriticalagentfunctionalitiessuchaslong-termplanning,dynamicenvironmentalinteraction,andcomplexdecision-making.AlthoughnumerousrecentstudieshaveexploredvariousstrategiestooptimizeLLM-basedagentsforcomplexagenttasks,asystematicreviewsummarizingandcomparingthesemethodsfromaholisticperspectiveisstilllacking.Inthissurvey,weprovideacomprehensivereviewofLLM-basedagentoptimizationapproaches,categorizingthemintoparameter-drivenandparameter-freemethods.Wefirstfocusonparameter-drivenoptimization,coveringfine-tuning-basedoptimization,reinforcementlearning-basedoptimization,andhybridstrategies,analyzingkeyaspectssuchastrajectorydataconstruction,fine-tuningtechniques,rewardfunctiondesign,andoptimizationalgorithms.Additionally,webrieflydiscussparameter-freestrategiesthatoptimizeagentbehaviorthroughpromptengineeringandexternalknowledgeretrieval.Finally,wesummarizethedatasetsandbenchmarksusedforevaluationandtuning,reviewkeyapplicationsofLLM-basedagents,anddiscussmajorchallengesandpromisingfuturedirections.Ourrepositoryforrelatedreferencesisavailableat

/YoungDubbyDu/LLM-Agent-Optimization

.

1Introduction

Thedevelopmentofautonomousagentshasbeenalong-termpursuitinArtificialIntelligence(AI).AIagentshaveevolvedfromearlyrule-basedandexpertsystem-basedarchitecturestoreinforce-mentlearning(RL)-drivenagents,whicharenowwidelyappliedinmanyfields[

35

].TraditionalRL-basedagentsoptimizepoliciesthroughinteractionwithenvironments,usingstructuredrewardfunctionstoachievegoalsandimproveperformanceovertime.However,theseapproachesoften

+Correspondingauthor.

Authors’ContactInformation:

ShanghengDu

,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,dsh@.cn;JiabaoZhao,SchoolofComputerScienceandTechnology,DonghuaUniversity,Shanghai,China,jbzhao@;JinxinShi,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,jinxinshi@;ZhentaoXie,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,ecnudavidtao@;XinJiang,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,51275901099@;YanhongBai,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,Lucky_Baiyh@;LiangHe,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,lhe@.

2S.Duetal.

requireextensivetraining,relyonwell-definedstate-actionspaces,andstrugglewithgeneralizationacrossdiversetasks.

Inrecentyears,LargeLanguageModels(LLMs)suchasGPT-4[

120

],PaLM2[

5

],andDeepseek-r1[

52

]haveachievedremarkablesuccess,demonstratingexceptionalcapabilitiesinlanguageunderstanding,reasoning,planningandcomplexdecision-making.Buildingonthesestrengths,LLMscanserveasagents,providingapromisingpathwaytoimproveautonomousdecision-makingandachieveAGI[

169

].UnlikeconventionalRL-basedagents,whichoptimizeexplicitreward-drivenpolicies,LLM-basedagentsoperatethroughtext-basedinstructionsandprompttemplatesandin-contextlearning(ICL),allowinggreaterflexibilityandgeneralization.TheseagentsleveragethecomprehensionandreasoningcapabilitiesofLLMstointeractwithenvironmentsthroughnaturallanguage,executecomplexmulti-steptasks,anddynamicallyadapttoevolvingscenarios.ExistingLLMagentsutilizevariousmethodssuchastaskdecomposition[

64

],self-reflection[

133

],memoryaugmentation[

210

],andmulti-agentcollaboration[

86

]toachievehighperformanceacrossarangeofdomains,includingsoftwaredevelopment[

67

],mathematicalreasoning[

1

],embodiedintelligence

[212

],webnavigation

[28

],andmore.

However,despitetheirstrengths,LLMsarenotinherentlydesignedforautonomousdecision-makingandlong-termtasks.Theirtrainingobjectivesfocusonnext-tokenpredictionratherthanreasoning,planning,orinteractivelearningrequiredforagent-basedtasks,sotheylackexplicittrainingonagent-centrictasks.Asaresult,deployingLLMsasagentsincomplexenvironmentspresentsseveralkeychallenges:1)LLM-basedagentsstrugglewithlong-horizonplanningandmulti-stepreasoning,astheirgenerativecontentmayleadtotaskinconsistenciesorerroraccumulationoverextendedinteractions.2)LimitedmemorycapacityinLLMshindersagentsfromutilizingpastexperiencesforreflection,leadingtosuboptimaldecision-makingandtaskperformance.3)TheadaptabilityofLLM-basedagentstonovelenvironmentsisconstrained,astheyprimarilyrelyonpre-trainedknowledgeorfixedcontexts,limitingtheirabilitytohandledynamicscenarios.Theselimitationsareparticularlyevidentinopen-sourceLLMs,whichlagbehindproprietarymodelslikeGPT-4inagent-specificcapabilities.Additionally,thehighcostandlackoftransparencyofclosed-sourceLLMshighlighttheneedforoptimizingopenLLMstoenhanceagentcapabilities.

Existingtechniques,suchassupervisedfine-tuning(SFT)[

122

]andreinforcementlearningwithhumanfeedback(RLHF)[

121

],havemadesignificantstridesinimprovingLLMperformanceininstructionfollowingtasks,buttheyfailtofullyaddressthechallengesofdecision-making,long-termplanning,andadaptabilityforLLM-basedagents.OptimizingLLM-basedagentsrequiresabroaderunderstandingofdynamicenvironmentsandagentbehaviors,whichneedstodesignspecializedtechniquesthatgobeyondtraditionalLLMfine-tuningandpromptengineeringmethods.Toaddressthesechallenges,numerousrecentstudieshaveexploredvariousstrategiestooptimizeLLM-basedagentsforcomplexagenttasks.Thesemethodsensurethatagentscangeneralizeacrossdiverseenvironments,refinestrategiesbasedonfeedback,andefficientlyutilizeexternalresourcessuchastools,memory,andretrievalmechanisms.

Inthispaper,weprovideacomprehensivesurveyonLLM-basedagentoptimization,system-aticallycategorizingmethodsintoparameter-drivenandparameter-freeoptimizationstrategies.Ourworkfocusesonthetechnicalmethodologiesemployedtooptimizeagentcapabilitieslikeagenttuning,RL,andotherstoimproveagentperformance.Specifically,Parameter-drivenOptimizationrefinesLLMparameterstoenhanceagentperformance.Thiscategoryincludesconventionalfine-tuningapproaches,coveringkeystagessuchasagenttrajectorydataconstructionandfine-tuningstrategies.Inaddition,weexploreRL-basedoptimization,whichisdividedintotwodistinctoptimizationdirections:rewardfunction-basedmethodsleveragingtraditionalRLtechniqueslikeActor-Critic[

147

]andProximalPolicyOptimization(PPO)[

136

],andpreferencealignment-basedmethodsutilizingDirectPreferenceOptimization(DPO)[

132

]tosynchronize

ASurveyontheOptimizationofLargeLanguageModel-basedAgents3

agentpolicieswithhumanpreferenceortask-specificobjectives.Finally,wediscusshybridfine-tuningoptimizationstrategies,arisingarea,whichcombineSFTwithRLtoiterativelyrefineagentbehavior.Incontrast,wealsobrieflyoutlineParameter-freeOptimizationmethodsthatfocusonimprovingagentbehaviorwithoutmodifyingmodelparameters.Thesemethodsleveragepromptengineering,in-contextlearningandretrieval-augmentedgeneration(RAG),incorporatingvarioustypesofinformationintopromptstoguideagents’actions.Theyarecategorizedintofeedback-basedoptimization,experience-basedoptimization,tool-basedoptimization,retrieval-augmentedoptimization,andmulti-agentcollaborativeoptimization.

Fig.1.AnOverviewofthePaperOrganization.

Comparisontorelatedsurveys.DespitethegrowingresearchinterestinLLM-basedagents,existingsurveysprimarilyfocusongeneralLLMoptimizationorspecificagentabilitiessuchasplanning,memory,androle-playing,withouttreatingLLM-basedagentoptimizationasadistinctresearcharea.SurveysonLLMoptimizationmainlycoverfine-tuning[

115

,

122

]andself-evolutionapproaches[

150

],butlackdiscussionsonspecializedoptimizationrequiredforagentcapabilities.Ontheotherhand,existingagent-relatedsurveysgenerallycategorizeworksbasedonarchitecturalcomponentssuchasplanning[

64

],memory[

210

],ormulti-agentcoordination[

86

],ratherthansystematicallysummarizingthetechniquesdedicatedtooptimizeLLM-basedagentbehaviorsandperformance.Ascomparison,thisworkisthefirstsurveytowardsLLM-basedagentoptimization

4S.Duetal.

techniques,facilitatingaclearerunderstandingandcomparisonofexistingmethodsandprovidingdirectionsforfutureresearch.

Scopeandrationales.(1)WesurveyonlyLLM-basedagentoptimizationalgorithmstoimproveagenttaskperformance,suchasproblem-solvinganddecision-making,coveringparameter-drivenandparameter-freeapproaches.WeexcludeworkscenteredongeneralLLMefficiency,role-playing,ordialogue;(2)OurselectionincludespapersfromAIandNLPconferencesandjournals,aswellasrecenthigh-impactpreprintsfromarXivtoensurecoverageofthelatestadvancements.(3)Wefocusonstudiespublishedsince2022toreflectrecentadvancementsinLLM-basedagentoptimization.

Organizationofthissurvey.Theschematicrepresentationofthismanuscript’slayoutcanbefoundinFigure

1

.Section

2

providesthebackgroundknowledgeandrelatedconcepts.InSection

3

,wesystematicallyreviewparameter-drivenoptimizationapproachesthatmodifyLLMparameterstoenhanceagentcapabilities,categorizingthemintothreemainstrategies:fine-tuning-basedoptimization(§

3.1

),RL-basedoptimization(§

3.2

),andhybridoptimization(§

3.3

).Section

4

summarizesandclassifiesexistingworkonparameter-freeoptimizationstrategies.Then,Section

5

presentsdatasetsandbenchmarks,whileSection

6

reviewspracticalapplicationsacrossvariousdomains.Finally,Section

7

highlightschallengesandfuturedirections.

2Background

2.1ReinforcementLearning-basedAgentOptimization

RLhaslongbeenafundamentalapproachinagentoptimization,allowingagentstolearnfrominter-actionswithenvironments.CurrentRLmethodsmainlyoptimizeagentbehaviorsusingvalue-basedandpolicy-basedapproaches[

35

,

106

,

117

].Value-basedmethods,suchasQ-learning[

25

,

163

],optimizeanagent’saction-valuefunctiontomaximizelong-termrewards.Thesemethodsareeffectiveindiscreteactionspacesbutstrugglewithhigh-dimensionalstatesoractionspaces.Policy-basedmethods,includingPolicyGradient[

48

,

124

],directlyoptimizetheagent’spolicybyadjustingparametersbasedonrewardgradients.Toimprovestabilityandsampleefficiency,PPO[

136

]introducedaconstraintonpolicyupdates,mitigatingperformancedegradationduringtraining.Actor-Criticmethods[

147

]combinevalueestimationwithpolicylearning,improvingconvergenceefficiencyanddecisionrobustness.Beyondsingle-agentsettings,Multi-AgentRein-forcementLearning(MARL)extendsRLtechniquestoscenariosinvolvingmultipleinteractingagents,enablingbothcooperativeandcompetitivedynamics

[12

,

204

].

Inrecentyears,RLhasalsobeenincreasinglyappliedtoaligningAIagentswithhumanin-tentions,particularlyinpreference-basedoptimization.RLHF[

121

]hasemergedasaprominentapproach,refiningagentpoliciesbasedonhuman-providedsignalstoimprovealignmentwithdesiredbehaviors.DPO[

132

]optimizespoliciesdirectlyfrompreferencedatawithoutrewardmod-eling,improvingalignmentandcontrollability.Overall,RL-basedoptimizationhasevolvedfromearlyvalue-basedandpolicy-basedlearningtomoreadvancedtechniquesthatintegratestructuredfeedbackandmulti-agentcoordination,providingafoundationforimprovingdecision-makinginLLM-basedagents.

2.2LLMFine-Tuning

LLMfine-tuningisacriticalmethodforadaptingpre-trainedmodelstospecifictasksthroughopti-mizingparameters,makingthemmoresuitedtothedesiredapplication.ThemostpopularapproachisSFT,whereLLMsaretrainedonlabeleddatatoimprovetask-specificperformance.InstructionTuningisacommonlyusedmethodinSFT,whereLLMsarefurthertrainedoninstruction-outputpairstoenhancetheirabilitytofollowhumancommands[

98

,

205

].Anothermajordevelopmentisparameter-efficientfine-tuning(PEFT),includingmethodslikeP-Tuning[

103

],LoRA[

59

],and

ASurveyontheOptimizationofLargeLanguageModel-basedAgents5

QLoRA[

30

].Thesetechniquesadjustasmallsubsetofparameters,significantlyreducingthecom-putationalcostoffine-tuningwhilepreservingLLMperformance,makingthemhighlyefficientforreal-worldapplications.Additionally,RLHFhasbeenusedtofine-tuneLLMsbyintegratinghumanfeedback,improvingtheirdecision-makingandoutputalignmentwithuserpreferences[

121

].TheseoptimizationtechniquesenableLLMstoadaptmoreefficientlytoawiderangeoftasks,enhancingtheireffectivenessinreal-worldscenarios.

2.3LLM-basedRAG

RAGcombinesLLMwithexternalinformationretrievalsystemstoenhancetherelevanceandaccuracyofgeneratedoutputs.Byretrievingrelevantdocumentsfromexternalsources,RAGallowsLLMstoaddresstheknowledgeconstraintsinherentinmodels.TheevolutionofRAGmethodshasbeenmarkedbysignificantadvancementsinretrievalandgenerationintegration[

44

].Early,NaiveRAGmethodsfocusondirectlyretrievingrelevantdocumentstoaugmentthegenerativeprocess,improvingthequalityofresponsesintasksrequiringfactualknowledge.ToaddressthechallengesofNaiveRAG,AdvancedRAGisintroduced,refiningtheretrievalprocessbyincorporatingmoreef-fectiveranking,filtering,anddocumentselectionstrategies.Subsequently,ModularRAGintroducesamodularframeworkthatoptimizestheretrievalandgenerativecomponentsindependently.Thismodularapproachenablestask-specificoptimizations,allowingformoreflexibilityandscalabilityinapplicationsacrossdifferentdomains[

8

,

193

].TheseadvancementsinRAGhighlightitspotentialtoenhanceLLMsbyenablingdynamicaccesstoexternalknowledge,makingthemmoreadaptableandcapableofaddressingcomplextasksinreal-worldscenarios.

3Parameter-drivenOptimizationofLLM-basedAgents

ComparisonwithLLMparameteroptimization.Parameter-drivenLLMoptimizationfocuseson"howtocreateabettermodel",aimingtoenhancegenerallanguageunderstanding,instructionfollowing,andbroadtaskperformance.Incontrast,LLM-basedagentparameteroptimizationaddresses"howtousethemodeltosolvecomplexagenttasks",emphasizingdecision-making,multi-stepreasoning,andtaskexecutionindynamicenvironments.AlthoughgeneralLLMoptimizationimprovesfluencyandfactualaccuracyacrossdiverseapplications,LLM-agentoptimizationistask-specific,requiringmodelstoadaptstrategies,interactwithenvironments,andrefinebehaviorsforautonomousproblem-solving.Parameter-drivenoptimizationofLLM-basedagentsprimarilyreliesonexperttrajectorydataorself-generatedtrajectorydataobtainedthroughenvironmentexploration,thenemploysvariousoptimizationtechniquestoiterativelyrefinepoliciesandenhanceperformance.

Inthissection,wediscusshowparameter-drivenoptimizationmethodsimprovetheperformanceofLLM-basedagents.Specifically,wecategorizethesemethodsintothreemaintechnicalapproachesaccordingtodifferentstrategiesforparametertuning:conventionalfine-tuning-basedoptimization,reinforcementlearning-basedoptimization,andhybridoptimization.

3.1ConventionalFine-Tuning-basedOptimization

Conventionalfine-tuning-basedagentoptimizationinvolvestuningpre-trainedLLMs’parametersthroughvariousfine-tuningtechniques,suchasinstructiontuningandparameter-efficientfine-tuning.Trajectoryforfine-tuningtypicallyareconstructedintheformofSFTandisusedtoadjusttheagent’sparameterstobetteralignwithtask-specificrequirements.Theoptimizationprocesstypicallyconsistsoftwomajorsteps:1)constructinghigh-qualitytrajectorydatatailoredtoagenttasks;2)fine-tuningLLM-basedagentsusingthesetrajectorydata,andthecompleteprocessispresentedinFigure

2

.Previousstudies[

40

,

83

,

122

]haveshownthatthequalityoftrainingdatasignificantlyimpactsmodelperformance,highlightingtheimportance

6S.Duetal.

Fig.2.WorkflowofFine-Tuning-basedOptimizationforLLM-basedAgents.

ofgenerating,filtering,andeffectivelyutilizinghigh-qualitytrajectories.Thismakestrajectoryconstructionacriticalstepinthefine-tuningpipeline,directlyinfluencingtheLLM-basedagent’soverallperformance.InTable

1

,weprovideacomprehensiveoverviewoffine-tuning-basedagentoptimizationmethods,highlightingthedataprocessingtechniquesandfine-tuningstrategiesusedineachwork.Itisimportanttonotethatthissectionexcludesfine-tuningmethodsthatinvolvereinforcementlearningorpreferencealignmenttechniques(e.g.,DPO,PPO),whichwillbeaddressedin§

3.2

.Instead,inthissection,weonlyfocusonthepartoftraditionalLLMfine-tuningtechniquesappliedinexistingworks,aimingtoensureeachstageoftheconventionalfine-tuning-basedagentoptimizationworkflowisclearlyintroduced.

3.1.1TrajectoryDataConstructionforAgentFine-Tuning.Theconstructionofhigh-qualitytrajectoryisacrucialstepbeforethefine-tuningofLLM-basedagents,whichaimstoempowerLLMswithagentability.Thisprocessinvolvesthegenerationoftrajectorydata,followedbyevaluationandfiltering,andthepotentialutilizationoflow-qualitysamples,toconstructrefineddatathatmeettherequirementsforeffectivefine-tuning.

DataAcquisitionandGeneration.High-qualitytrajectorydataconstructionbeginswiththeacquisitionandgenerationofinitialdata,whichrequiresnotonlyadiversesetoftrajectories,butalsosufficientalignmentwiththetargettaskstoensureeffectivelearning.Methodsforacquiringandgeneratingsuchdatacangenerallybeclassifiedintofourbroadcategories:expert-annotateddata,strongLLM-generatedtrajectories,self-explorationenvironment-interactiontrajectories,andmulti-agentcollaboration-basedconstruction.Here,weintroducetheutilizationandconstructionprocessesofeachcategoryandreviewtherelevantstudies.

(1)Expert-annotateddata.Expert-annotatedtrajectoriesrefertohigh-qualitydatasetsman-uallycraftedbyhumanexperts,oftenconsideredthegoldstandardforfine-tuning.Thesedataensuretaskreliabilityandalignment,asexpertscanmeticulouslydesignandannotatetrajectoriestailoredtospecificcases.

Manyworks[

14

,

39

,

144

,

158

,

177

]utilizeReAct-styleexperttrajectoriesasinitialdatasets,withdataincludingthoughts,observationsandactions[

189

],whichenableagentstomimicexpertdecision-makingprocessesmoreeffectively.Forinstance,IPR[

177

]leveragessuchtrajectoriestohelpagentsacquirefoundationalskills.Similarly,ETO[

144

]andAGILE[

39

]applyChainofThought

ASurveyontheOptimizationofLargeLanguageModel-basedAgents7

Table1.ComparisonofConventionalFine-Tuning-basedOptimizationforLLM-basedAgents:DataCon-

structionandFine-Tuning.Note:MA-Multi-AgentFramework;LQ-Low-QualityDataUtilization.

Method

TrajectoryAgentDataConstruction

Fine-Tuning

Generation

Filtering

MA

LQ

Fine-tuneApproachBaseModel

AgentTuning

[199]

StrongLLM

HumanorRule

InstructionTuning

Llama-2-7B/13B/70B

SMART

[197]

Multi-agent

Environment

/

LoRA

Llama-2-7B

Agent-FLAN

[22]

Expert

Model

InstructionTuning

Llama-2-7B

Self-Talk

[153]

Multi-agent

HumanorRule

/

LoRA

MosaicAI-7B-Chat

ENVISIONS

[178]

Self-exploration

Environment

/

SFT

Llama2-7B/13B-Chat

AgentGym

[170]

StrongLLM&Expert

Environment

/

BC

Llama-2-7B-Chat

FireAct

[14]

StrongLLM

Environment

/

/

LoRA

GPT3.5,Llama-2-7B/13B,CodeLlama-7B/13B/34B-Instruct

NAT

[158]

StrongLLM

Environment

/

SFT

Llama-2-7B/13B-Chat

AgentLumos

[192]

StrongLLM

HumanorRule

/

/

LoRA

Llama-2-7B/13B

STE

[154]

Self-exploration

Model

/

SFT

Llama-2-7B/13B-Chat,Mistral-7B-Instruct

OPTIMA

[19]

Multi-agent

HumanorRule

/

SFT

Llama-3-8B

Zhouetal.

[216]

StrongLLM

HumanorRule

/

LoRA

OpenChatv3.2,Llama-2-7B,AgentLM-7B

AgentOhana

[202]

Expert

Model

/

/

QLoRA

xLAM-v0.1

COEVOL

[85]

Expert

Model

/

SFT

Llama-2-7B,Mistral-7B

AGENTBANK

[143]

StrongLLM

Environment

/

InstructionTuning

Llama-2-Chat

ADASWITCH

[146]

Self-exploration

Model

SFT

DeepSeek-Coder-1.3B,StarCoder2-3B

IPR

[177]

Expert&Self-exploration

Environment

/

InstructionTuning

Llama-2-7B

Re-ReST

[33]

Self-exploration

Environment

/

LoRA

Llama-2-7B/13B,Llama-3-8B,CodeLlama-13B,VPGen

ATM

[219]

Multi-agent

/

/

MITO

Llama-2-7B

Aksitovetal.

[3]

Self-exploration

Model-based

/

/

SFT

PaLM-2-base-series

SWIFTSAGE

[94]

Self-exploration

Environment

/

SFT

T5-Large

AGILE

[39]

Expert

/

/

/

BC

Vicuna-13B,Meerkat-7B

NLRL

[40]

Self-exploration

/

/

/

SFT

Llama-3.1-8B-Instruct

ETO

[144]

Expert

/

/

BC

Llama-2-7B-Chat

Retrospex

[171]

Expert

/

/

BC

Flan-T5-Large,Llama-3-8B-Instruct

ToRA

[49]

StrongLLM

HumanorRule

/

BC

Llama-2-series,CodeLlama-series

Sayself

[179]

StrongLLM

HumanorRule

/

/

SFT

Mistral-7B,Llama-3-8B

(CoT)methods[

164

]toexperttrajectoriesforimitationlearning,reinforcingtask-specificbehaviors.Toensurealignmentwithpre-trainedLLMdomains,Agent-FLAN[

22

]transformsReAct-styleexperttrajectoriesintomulti-turndialogue,segmentingthedialogueintodifferenttask-specificturn,suchasinstruction-followingandreasoning.StepAgent[

29

]introducesatwo-phaselearningprocess,whereagentsfirstobservediscrepanciesbetweentheirpoliciesandexperttrajectories,theniterativelyrefinetheiractions.Additionally,AgentOhana[

202

]standardizesheterogeneousagentexperttrajectoriesintoaunifiedformattoimprovedataconsistency.Despitetheirreliabilityandalignmentwithspecifictasks,thesedatasetsareresource-intensiveandlackscalability,makingthemcommonlysupplementedwithotherdataacquisitionmethodstoenhancedatasetdiversity.

(2)StrongLLM-generatedtrajectories.StrongLLM-generatedtrajectoriesleveragepowerfulLLMslikeChatGPTandGPT-4toautonomouslygeneratetask-specificdata.ThesetrajectoriesareusuallyproducedbyreasoningframeworkssuchasReActandCoT,allowingthemodeltointeractwiththeenvironmentandsimulateprocessesofreasoning,decision-makingandacting.

AgentTuning[

199

]andFireAct[

14

]employReActandCoTtoguideagentbehaviorwhileincorporatingReflexion[

139

]refinements,improvingthediversityofgenerateddata.Someworksintegratetoolsandstructuredannotationstoenhancetrajectoryinformativeness.NAT[

158

]generatesmultipletrajectoriesunderdifferenttemperaturesettings,usingReActpromptsandintegratingtoolssuchascalculatorsandAPIsduringinteractions.AgentLumos[

192

]utilizesGPT-4andGPT-4Vtoannotatedatasetswithinplanningandgroundingmodules,producingLUMOS-IandLUMOS-Ostyledata.Othermethodsexploremulti-rolesimulationtoenrichtrajectorycomplexity.Zhouetal.[

216

]employGPT-4tosimulateproblemgenerators,actionplanners,andenvironmentagents,enablingiterativeinteraction-drivendatageneration.AGENTBANK[

143

]alsoleveragesGPT-4forenvironmentinteractiondataandGPT-3.5forCoTrationales,andfinallytransformsthedataintochatbot-styleformatsforimprovedusability.

(3)Self-explorationenvironment-interactiontrajectories.Giventhehighcostsofexpert

annotationandp

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论