基于大语言模型的智能体优化研究综述

上传人：策*** IP属地：山西上传时间：2025-05-28 格式：DOCX 页数：64 大小：500.96KB 积分：19.9 举报 版权申诉

已阅读5页，还剩59页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

ASurveyontheOptimizationofLargeLanguageModel-basedAgents

arXiv:2503.12434v1[cs.AI]16Mar2025

SHANGHENGDU

,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniver-sity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

JIABAOZHAO∗,SchoolofComputerScienceandTechnology,DonghuaUniversity,China

JINXINSHI,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

ZHENTAOXIE,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

XINJIANG,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

YANHONGBAI,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

LIANGHE,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

WiththerapiddevelopmentofLargeLanguageModels(LLMs),LLM-basedagentshavebeenwidelyadoptedinvariousfields,becomingessentialforautonomousdecision-makingandinteractivetasks.However,currentworktypicallyreliesonpromptdesignorfine-tuningstrategiesappliedtovanillaLLMs,whichoftenleadstolimitedeffectivenessorsuboptimalperformanceincomplexagent-relatedenvironments.AlthoughLLMoptimizationtechniquescanimprovemodelperformanceacrossmanygeneraltasks,theylackspecializedoptimizationtowardscriticalagentfunctionalitiessuchaslong-termplanning,dynamicenvironmentalinteraction,andcomplexdecision-making.AlthoughnumerousrecentstudieshaveexploredvariousstrategiestooptimizeLLM-basedagentsforcomplexagenttasks,asystematicreviewsummarizingandcomparingthesemethodsfromaholisticperspectiveisstilllacking.Inthissurvey,weprovideacomprehensivereviewofLLM-basedagentoptimizationapproaches,categorizingthemintoparameter-drivenandparameter-freemethods.Wefirstfocusonparameter-drivenoptimization,coveringfine-tuning-basedoptimization,reinforcementlearning-basedoptimization,andhybridstrategies,analyzingkeyaspectssuchastrajectorydataconstruction,fine-tuningtechniques,rewardfunctiondesign,andoptimizationalgorithms.Additionally,webrieflydiscussparameter-freestrategiesthatoptimizeagentbehaviorthroughpromptengineeringandexternalknowledgeretrieval.Finally,wesummarizethedatasetsandbenchmarksusedforevaluationandtuning,reviewkeyapplicationsofLLM-basedagents,anddiscussmajorchallengesandpromisingfuturedirections.Ourrepositoryforrelatedreferencesisavailableat

/YoungDubbyDu/LLM-Agent-Optimization

1Introduction

Thedevelopmentofautonomousagentshasbeenalong-termpursuitinArtificialIntelligence(AI).AIagentshaveevolvedfromearlyrule-basedandexpertsystem-basedarchitecturestoreinforce-mentlearning(RL)-drivenagents,whicharenowwidelyappliedinmanyfields[

].TraditionalRL-basedagentsoptimizepoliciesthroughinteractionwithenvironments,usingstructuredrewardfunctionstoachievegoalsandimproveperformanceovertime.However,theseapproachesoften

+Correspondingauthor.

Authors’ContactInformation:

ShanghengDu

,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,dsh@.cn;JiabaoZhao,SchoolofComputerScienceandTechnology,DonghuaUniversity,Shanghai,China,jbzhao@;JinxinShi,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,jinxinshi@;ZhentaoXie,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,ecnudavidtao@;XinJiang,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,51275901099@;YanhongBai,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,Lucky_Baiyh@;LiangHe,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,lhe@.

2S.Duetal.

requireextensivetraining,relyonwell-definedstate-actionspaces,andstrugglewithgeneralizationacrossdiversetasks.

Inrecentyears,LargeLanguageModels(LLMs)suchasGPT-4[

120

],PaLM2[

],andDeepseek-r1[

]haveachievedremarkablesuccess,demonstratingexceptionalcapabilitiesinlanguageunderstanding,reasoning,planningandcomplexdecision-making.Buildingonthesestrengths,LLMscanserveasagents,providingapromisingpathwaytoimproveautonomousdecision-makingandachieveAGI[

169

].UnlikeconventionalRL-basedagents,whichoptimizeexplicitreward-drivenpolicies,LLM-basedagentsoperatethroughtext-basedinstructionsandprompttemplatesandin-contextlearning(ICL),allowinggreaterflexibilityandgeneralization.TheseagentsleveragethecomprehensionandreasoningcapabilitiesofLLMstointeractwithenvironmentsthroughnaturallanguage,executecomplexmulti-steptasks,anddynamicallyadapttoevolvingscenarios.ExistingLLMagentsutilizevariousmethodssuchastaskdecomposition[

],self-reflection[

133

],memoryaugmentation[

210

],andmulti-agentcollaboration[

]toachievehighperformanceacrossarangeofdomains,includingsoftwaredevelopment[

],mathematicalreasoning[

],embodiedintelligence

[212

],webnavigation

[28

],andmore.

However,despitetheirstrengths,LLMsarenotinherentlydesignedforautonomousdecision-makingandlong-termtasks.Theirtrainingobjectivesfocusonnext-tokenpredictionratherthanreasoning,planning,orinteractivelearningrequiredforagent-basedtasks,sotheylackexplicittrainingonagent-centrictasks.Asaresult,deployingLLMsasagentsincomplexenvironmentspresentsseveralkeychallenges:1)LLM-basedagentsstrugglewithlong-horizonplanningandmulti-stepreasoning,astheirgenerativecontentmayleadtotaskinconsistenciesorerroraccumulationoverextendedinteractions.2)LimitedmemorycapacityinLLMshindersagentsfromutilizingpastexperiencesforreflection,leadingtosuboptimaldecision-makingandtaskperformance.3)TheadaptabilityofLLM-basedagentstonovelenvironmentsisconstrained,astheyprimarilyrelyonpre-trainedknowledgeorfixedcontexts,limitingtheirabilitytohandledynamicscenarios.Theselimitationsareparticularlyevidentinopen-sourceLLMs,whichlagbehindproprietarymodelslikeGPT-4inagent-specificcapabilities.Additionally,thehighcostandlackoftransparencyofclosed-sourceLLMshighlighttheneedforoptimizingopenLLMstoenhanceagentcapabilities.

Existingtechniques,suchassupervisedfine-tuning(SFT)[

122

]andreinforcementlearningwithhumanfeedback(RLHF)[

121

],havemadesignificantstridesinimprovingLLMperformanceininstructionfollowingtasks,buttheyfailtofullyaddressthechallengesofdecision-making,long-termplanning,andadaptabilityforLLM-basedagents.OptimizingLLM-basedagentsrequiresabroaderunderstandingofdynamicenvironmentsandagentbehaviors,whichneedstodesignspecializedtechniquesthatgobeyondtraditionalLLMfine-tuningandpromptengineeringmethods.Toaddressthesechallenges,numerousrecentstudieshaveexploredvariousstrategiestooptimizeLLM-basedagentsforcomplexagenttasks.Thesemethodsensurethatagentscangeneralizeacrossdiverseenvironments,refinestrategiesbasedonfeedback,andefficientlyutilizeexternalresourcessuchastools,memory,andretrievalmechanisms.

Inthispaper,weprovideacomprehensivesurveyonLLM-basedagentoptimization,system-aticallycategorizingmethodsintoparameter-drivenandparameter-freeoptimizationstrategies.Ourworkfocusesonthetechnicalmethodologiesemployedtooptimizeagentcapabilitieslikeagenttuning,RL,andotherstoimproveagentperformance.Specifically,Parameter-drivenOptimizationrefinesLLMparameterstoenhanceagentperformance.Thiscategoryincludesconventionalfine-tuningapproaches,coveringkeystagessuchasagenttrajectorydataconstructionandfine-tuningstrategies.Inaddition,weexploreRL-basedoptimization,whichisdividedintotwodistinctoptimizationdirections:rewardfunction-basedmethodsleveragingtraditionalRLtechniqueslikeActor-Critic[

147

]andProximalPolicyOptimization(PPO)[

136

],andpreferencealignment-basedmethodsutilizingDirectPreferenceOptimization(DPO)[

132

]tosynchronize

ASurveyontheOptimizationofLargeLanguageModel-basedAgents3

agentpolicieswithhumanpreferenceortask-specificobjectives.Finally,wediscusshybridfine-tuningoptimizationstrategies,arisingarea,whichcombineSFTwithRLtoiterativelyrefineagentbehavior.Incontrast,wealsobrieflyoutlineParameter-freeOptimizationmethodsthatfocusonimprovingagentbehaviorwithoutmodifyingmodelparameters.Thesemethodsleveragepromptengineering,in-contextlearningandretrieval-augmentedgeneration(RAG),incorporatingvarioustypesofinformationintopromptstoguideagents’actions.Theyarecategorizedintofeedback-basedoptimization,experience-basedoptimization,tool-basedoptimization,retrieval-augmentedoptimization,andmulti-agentcollaborativeoptimization.

Fig.1.AnOverviewofthePaperOrganization.

Comparisontorelatedsurveys.DespitethegrowingresearchinterestinLLM-basedagents,existingsurveysprimarilyfocusongeneralLLMoptimizationorspecificagentabilitiessuchasplanning,memory,androle-playing,withouttreatingLLM-basedagentoptimizationasadistinctresearcharea.SurveysonLLMoptimizationmainlycoverfine-tuning[

115

122

]andself-evolutionapproaches[

150

],butlackdiscussionsonspecializedoptimizationrequiredforagentcapabilities.Ontheotherhand,existingagent-relatedsurveysgenerallycategorizeworksbasedonarchitecturalcomponentssuchasplanning[

],memory[

210

],ormulti-agentcoordination[

],ratherthansystematicallysummarizingthetechniquesdedicatedtooptimizeLLM-basedagentbehaviorsandperformance.Ascomparison,thisworkisthefirstsurveytowardsLLM-basedagentoptimization

4S.Duetal.

techniques,facilitatingaclearerunderstandingandcomparisonofexistingmethodsandprovidingdirectionsforfutureresearch.

Scopeandrationales.(1)WesurveyonlyLLM-basedagentoptimizationalgorithmstoimproveagenttaskperformance,suchasproblem-solvinganddecision-making,coveringparameter-drivenandparameter-freeapproaches.WeexcludeworkscenteredongeneralLLMefficiency,role-playing,ordialogue;(2)OurselectionincludespapersfromAIandNLPconferencesandjournals,aswellasrecenthigh-impactpreprintsfromarXivtoensurecoverageofthelatestadvancements.(3)Wefocusonstudiespublishedsince2022toreflectrecentadvancementsinLLM-basedagentoptimization.

Organizationofthissurvey.Theschematicrepresentationofthismanuscript’slayoutcanbefoundinFigure

.Section

providesthebackgroundknowledgeandrelatedconcepts.InSection

,wesystematicallyreviewparameter-drivenoptimizationapproachesthatmodifyLLMparameterstoenhanceagentcapabilities,categorizingthemintothreemainstrategies:fine-tuning-basedoptimization(§

3.1

),RL-basedoptimization(§

3.2

),andhybridoptimization(§

3.3

).Section

summarizesandclassifiesexistingworkonparameter-freeoptimizationstrategies.Then,Section

presentsdatasetsandbenchmarks,whileSection

reviewspracticalapplicationsacrossvariousdomains.Finally,Section

highlightschallengesandfuturedirections.

2Background

2.1ReinforcementLearning-basedAgentOptimization

RLhaslongbeenafundamentalapproachinagentoptimization,allowingagentstolearnfrominter-actionswithenvironments.CurrentRLmethodsmainlyoptimizeagentbehaviorsusingvalue-basedandpolicy-basedapproaches[

106

117

].Value-basedmethods,suchasQ-learning[

163

],optimizeanagent’saction-valuefunctiontomaximizelong-termrewards.Thesemethodsareeffectiveindiscreteactionspacesbutstrugglewithhigh-dimensionalstatesoractionspaces.Policy-basedmethods,includingPolicyGradient[

124

],directlyoptimizetheagent’spolicybyadjustingparametersbasedonrewardgradients.Toimprovestabilityandsampleefficiency,PPO[

136

]introducedaconstraintonpolicyupdates,mitigatingperformancedegradationduringtraining.Actor-Criticmethods[

147

]combinevalueestimationwithpolicylearning,improvingconvergenceefficiencyanddecisionrobustness.Beyondsingle-agentsettings,Multi-AgentRein-forcementLearning(MARL)extendsRLtechniquestoscenariosinvolvingmultipleinteractingagents,enablingbothcooperativeandcompetitivedynamics

[12

204

Inrecentyears,RLhasalsobeenincreasinglyappliedtoaligningAIagentswithhumanin-tentions,particularlyinpreference-basedoptimization.RLHF[

121

]hasemergedasaprominentapproach,refiningagentpoliciesbasedonhuman-providedsignalstoimprovealignmentwithdesiredbehaviors.DPO[

132

]optimizespoliciesdirectlyfrompreferencedatawithoutrewardmod-eling,improvingalignmentandcontrollability.Overall,RL-basedoptimizationhasevolvedfromearlyvalue-basedandpolicy-basedlearningtomoreadvancedtechniquesthatintegratestructuredfeedbackandmulti-agentcoordination,providingafoundationforimprovingdecision-makinginLLM-basedagents.

2.2LLMFine-Tuning

LLMfine-tuningisacriticalmethodforadaptingpre-trainedmodelstospecifictasksthroughopti-mizingparameters,makingthemmoresuitedtothedesiredapplication.ThemostpopularapproachisSFT,whereLLMsaretrainedonlabeleddatatoimprovetask-specificperformance.InstructionTuningisacommonlyusedmethodinSFT,whereLLMsarefurthertrainedoninstruction-outputpairstoenhancetheirabilitytofollowhumancommands[

205

].Anothermajordevelopmentisparameter-efficientfine-tuning(PEFT),includingmethodslikeP-Tuning[

103

],LoRA[

],and

ASurveyontheOptimizationofLargeLanguageModel-basedAgents5

QLoRA[

].Thesetechniquesadjustasmallsubsetofparameters,significantlyreducingthecom-putationalcostoffine-tuningwhilepreservingLLMperformance,makingthemhighlyefficientforreal-worldapplications.Additionally,RLHFhasbeenusedtofine-tuneLLMsbyintegratinghumanfeedback,improvingtheirdecision-makingandoutputalignmentwithuserpreferences[

121

].TheseoptimizationtechniquesenableLLMstoadaptmoreefficientlytoawiderangeoftasks,enhancingtheireffectivenessinreal-worldscenarios.

2.3LLM-basedRAG

RAGcombinesLLMwithexternalinformationretrievalsystemstoenhancetherelevanceandaccuracyofgeneratedoutputs.Byretrievingrelevantdocumentsfromexternalsources,RAGallowsLLMstoaddresstheknowledgeconstraintsinherentinmodels.TheevolutionofRAGmethodshasbeenmarkedbysignificantadvancementsinretrievalandgenerationintegration[

].Early,NaiveRAGmethodsfocusondirectlyretrievingrelevantdocumentstoaugmentthegenerativeprocess,improvingthequalityofresponsesintasksrequiringfactualknowledge.ToaddressthechallengesofNaiveRAG,AdvancedRAGisintroduced,refiningtheretrievalprocessbyincorporatingmoreef-fectiveranking,filtering,anddocumentselectionstrategies.Subsequently,ModularRAGintroducesamodularframeworkthatoptimizestheretrievalandgenerativecomponentsindependently.Thismodularapproachenablestask-specificoptimizations,allowingformoreflexibilityandscalabilityinapplicationsacrossdifferentdomains[

193

].TheseadvancementsinRAGhighlightitspotentialtoenhanceLLMsbyenablingdynamicaccesstoexternalknowledge,makingthemmoreadaptableandcapableofaddressingcomplextasksinreal-worldscenarios.

3Parameter-drivenOptimizationofLLM-basedAgents

ComparisonwithLLMparameteroptimization.Parameter-drivenLLMoptimizationfocuseson"howtocreateabettermodel",aimingtoenhancegenerallanguageunderstanding,instructionfollowing,andbroadtaskperformance.Incontrast,LLM-basedagentparameteroptimizationaddresses"howtousethemodeltosolvecomplexagenttasks",emphasizingdecision-making,multi-stepreasoning,andtaskexecutionindynamicenvironments.AlthoughgeneralLLMoptimizationimprovesfluencyandfactualaccuracyacrossdiverseapplications,LLM-agentoptimizationistask-specific,requiringmodelstoadaptstrategies,interactwithenvironments,andrefinebehaviorsforautonomousproblem-solving.Parameter-drivenoptimizationofLLM-basedagentsprimarilyreliesonexperttrajectorydataorself-generatedtrajectorydataobtainedthroughenvironmentexploration,thenemploysvariousoptimizationtechniquestoiterativelyrefinepoliciesandenhanceperformance.

Inthissection,wediscusshowparameter-drivenoptimizationmethodsimprovetheperformanceofLLM-basedagents.Specifically,wecategorizethesemethodsintothreemaintechnicalapproachesaccordingtodifferentstrategiesforparametertuning:conventionalfine-tuning-basedoptimization,reinforcementlearning-basedoptimization,andhybridoptimization.

3.1ConventionalFine-Tuning-basedOptimization

Conventionalfine-tuning-basedagentoptimizationinvolvestuningpre-trainedLLMs’parametersthroughvariousfine-tuningtechniques,suchasinstructiontuningandparameter-efficientfine-tuning.Trajectoryforfine-tuningtypicallyareconstructedintheformofSFTandisusedtoadjusttheagent’sparameterstobetteralignwithtask-specificrequirements.Theoptimizationprocesstypicallyconsistsoftwomajorsteps:1)constructinghigh-qualitytrajectorydatatailoredtoagenttasks;2)fine-tuningLLM-basedagentsusingthesetrajectorydata,andthecompleteprocessispresentedinFigure

.Previousstudies[

122

]haveshownthatthequalityoftrainingdatasignificantlyimpactsmodelperformance,highlightingtheimportance

6S.Duetal.

Fig.2.WorkflowofFine-Tuning-basedOptimizationforLLM-basedAgents.

ofgenerating,filtering,andeffectivelyutilizinghigh-qualitytrajectories.Thismakestrajectoryconstructionacriticalstepinthefine-tuningpipeline,directlyinfluencingtheLLM-basedagent’soverallperformance.InTable

,weprovideacomprehensiveoverviewoffine-tuning-basedagentoptimizationmethods,highlightingthedataprocessingtechniquesandfine-tuningstrategiesusedineachwork.Itisimportanttonotethatthissectionexcludesfine-tuningmethodsthatinvolvereinforcementlearningorpreferencealignmenttechniques(e.g.,DPO,PPO),whichwillbeaddressedin§

3.2

.Instead,inthissection,weonlyfocusonthepartoftraditionalLLMfine-tuningtechniquesappliedinexistingworks,aimingtoensureeachstageoftheconventionalfine-tuning-basedagentoptimizationworkflowisclearlyintroduced.

3.1.1TrajectoryDataConstructionforAgentFine-Tuning.Theconstructionofhigh-qualitytrajectoryisacrucialstepbeforethefine-tuningofLLM-basedagents,whichaimstoempowerLLMswithagentability.Thisprocessinvolvesthegenerationoftrajectorydata,followedbyevaluationandfiltering,andthepotentialutilizationoflow-qualitysamples,toconstructrefineddatathatmeettherequirementsforeffectivefine-tuning.

DataAcquisitionandGeneration.High-qualitytrajectorydataconstructionbeginswiththeacquisitionandgenerationofinitialdata,whichrequiresnotonlyadiversesetoftrajectories,butalsosufficientalignmentwiththetargettaskstoensureeffectivelearning.Methodsforacquiringandgeneratingsuchdatacangenerallybeclassifiedintofourbroadcategories:expert-annotateddata,strongLLM-generatedtrajectories,self-explorationenvironment-interactiontrajectories,andmulti-agentcollaboration-basedconstruction.Here,weintroducetheutilizationandconstructionprocessesofeachcategoryandreviewtherelevantstudies.

(1)Expert-annotateddata.Expert-annotatedtrajectoriesrefertohigh-qualitydatasetsman-uallycraftedbyhumanexperts,oftenconsideredthegoldstandardforfine-tuning.Thesedataensuretaskreliabilityandalignment,asexpertscanmeticulouslydesignandannotatetrajectoriestailoredtospecificcases.

Manyworks[

144

158

177

]utilizeReAct-styleexperttrajectoriesasinitialdatasets,withdataincludingthoughts,observationsandactions[

189

],whichenableagentstomimicexpertdecision-makingprocessesmoreeffectively.Forinstance,IPR[

177

]leveragessuchtrajectoriestohelpagentsacquirefoundationalskills.Similarly,ETO[

144

]andAGILE[

]applyChainofThought

ASurveyontheOptimizationofLargeLanguageModel-basedAgents7

Table1.ComparisonofConventionalFine-Tuning-basedOptimizationforLLM-basedAgents:DataCon-

structionandFine-Tuning.Note:MA-Multi-AgentFramework;LQ-Low-QualityDataUtilization.

Method

TrajectoryAgentDataConstruction

Fine-Tuning

Generation

Filtering

Fine-tuneApproachBaseModel

AgentTuning

[199]

StrongLLM

HumanorRule

√

InstructionTuning

Llama-2-7B/13B/70B

SMART

[197]

Multi-agent

Environment

√

LoRA

Llama-2-7B

Agent-FLAN

[22]

Expert

Model

√

InstructionTuning

Llama-2-7B

Self-Talk

[153]

Multi-agent

HumanorRule

√

LoRA

MosaicAI-7B-Chat

ENVISIONS

[178]

Self-exploration

Environment

√

SFT

Llama2-7B/13B-Chat

AgentGym

[170]

StrongLLM&Expert

Environment

√

Llama-2-7B-Chat

FireAct

[14]

StrongLLM

Environment

LoRA

GPT3.5,Llama-2-7B/13B,CodeLlama-7B/13B/34B-Instruct

NAT

[158]

StrongLLM

Environment

√

SFT

Llama-2-7B/13B-Chat

AgentLumos

[192]

StrongLLM

HumanorRule

LoRA

Llama-2-7B/13B

STE

[154]

Self-exploration

Model

√

SFT

Llama-2-7B/13B-Chat,Mistral-7B-Instruct

OPTIMA

[19]

Multi-agent

HumanorRule

√

SFT

Llama-3-8B

Zhouetal.

[216]

StrongLLM

HumanorRule

√

LoRA

OpenChatv3.2,Llama-2-7B,AgentLM-7B

AgentOhana

[202]

Expert

Model

QLoRA

xLAM-v0.1

COEVOL

[85]

Expert

Model

√

SFT

Llama-2-7B,Mistral-7B

AGENTBANK

[143]

StrongLLM

Environment

√

InstructionTuning

Llama-2-Chat

ADASWITCH

[146]

Self-exploration

Model

√

SFT

DeepSeek-Coder-1.3B,StarCoder2-3B

IPR

[177]

Expert&Self-exploration

Environment

√

InstructionTuning

Llama-2-7B

Re-ReST

[33]

Self-exploration

Environment

√

LoRA

Llama-2-7B/13B,Llama-3-8B,CodeLlama-13B,VPGen

ATM

[219]

Multi-agent

√

MITO

Llama-2-7B

Aksitovetal.

[3]

Self-exploration

Model-based

SFT

PaLM-2-base-series

SWIFTSAGE

[94]

Self-exploration

Environment

√

SFT

T5-Large

AGILE

[39]

Expert

Vicuna-13B,Meerkat-7B

NLRL

[40]

Self-exploration

SFT

Llama-3.1-8B-Instruct

ETO

[144]

Expert

√

Llama-2-7B-Chat

Retrospex

[171]

Expert

√

Flan-T5-Large,Llama-3-8B-Instruct

ToRA

[49]

StrongLLM

HumanorRule

√

Llama-2-series,CodeLlama-series

Sayself

[179]

StrongLLM

HumanorRule

SFT

Mistral-7B,Llama-3-8B

(CoT)methods[

164

]toexperttrajectoriesforimitationlearning,reinforcingtask-specificbehaviors.Toensurealignmentwithpre-trainedLLMdomains,Agent-FLAN[

]transformsReAct-styleexperttrajectoriesintomulti-turndialogue,segmentingthedialogueintodifferenttask-specificturn,suchasinstruction-followingandreasoning.StepAgent[

]introducesatwo-phaselearningprocess,whereagentsfirstobservediscrepanciesbetweentheirpoliciesandexperttrajectories,theniterativelyrefinetheiractions.Additionally,AgentOhana[

202

]standardizesheterogeneousagentexperttrajectoriesintoaunifiedformattoimprovedataconsistency.Despitetheirreliabilityandalignmentwithspecifictasks,thesedatasetsareresource-intensiveandlackscalability,makingthemcommonlysupplementedwithotherdataacquisitionmethodstoenhancedatasetdiversity.

(2)StrongLLM-generatedtrajectories.StrongLLM-generatedtrajectoriesleveragepowerfulLLMslikeChatGPTandGPT-4toautonomouslygeneratetask-specificdata.ThesetrajectoriesareusuallyproducedbyreasoningframeworkssuchasReActandCoT,allowingthemodeltointeractwiththeenvironmentandsimulateprocessesofreasoning,decision-makingandacting.

AgentTuning[

199

]andFireAct[

]employReActandCoTtoguideagentbehaviorwhileincorporatingReflexion[

139

]refinements,improvingthediversityofgenerateddata.Someworksintegratetoolsandstructuredannotationstoenhancetrajectoryinformativeness.NAT[

158

]generatesmultipletrajectoriesunderdifferenttemperaturesettings,usingReActpromptsandintegratingtoolssuchascalculatorsandAPIsduringinteractions.AgentLumos[

192

]utilizesGPT-4andGPT-4Vtoannotatedatasetswithinplanningandgroundingmodules,producingLUMOS-IandLUMOS-Ostyledata.Othermethodsexploremulti-rolesimulationtoenrichtrajectorycomplexity.Zhouetal.[

216

]employGPT-4tosimulateproblemgenerators,actionplanners,andenvironmentagents,enablingiterativeinteraction-drivendatageneration.AGENTBANK[

143

]alsoleveragesGPT-4forenvironmentinteractiondataandGPT-3.5forCoTrationales,andfinallytransformsthedataintochatbot-styleformatsforimprovedusability.

(3)Self-explorationenvironment-interactiontrajectories.Giventhehighcostsofexpert

annotationandp

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

基于大语言模型的智能体优化研究综述

文档简介

温馨提示

最新文档

评论

基于大语言模型的智能体优化研究综述

文档简介

温馨提示

最新文档

评论

相关文档