




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
ASurveyontheOptimizationofLargeLanguageModel-basedAgents
arXiv:2503.12434v1[cs.AI]16Mar2025
SHANGHENGDU
,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniver-sity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
JIABAOZHAO∗,SchoolofComputerScienceandTechnology,DonghuaUniversity,China
JINXINSHI,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
ZHENTAOXIE,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
XINJIANG,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
YANHONGBAI,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
LIANGHE,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
WiththerapiddevelopmentofLargeLanguageModels(LLMs),LLM-basedagentshavebeenwidelyadoptedinvariousfields,becomingessentialforautonomousdecision-makingandinteractivetasks.However,currentworktypicallyreliesonpromptdesignorfine-tuningstrategiesappliedtovanillaLLMs,whichoftenleadstolimitedeffectivenessorsuboptimalperformanceincomplexagent-relatedenvironments.AlthoughLLMoptimizationtechniquescanimprovemodelperformanceacrossmanygeneraltasks,theylackspecializedoptimizationtowardscriticalagentfunctionalitiessuchaslong-termplanning,dynamicenvironmentalinteraction,andcomplexdecision-making.AlthoughnumerousrecentstudieshaveexploredvariousstrategiestooptimizeLLM-basedagentsforcomplexagenttasks,asystematicreviewsummarizingandcomparingthesemethodsfromaholisticperspectiveisstilllacking.Inthissurvey,weprovideacomprehensivereviewofLLM-basedagentoptimizationapproaches,categorizingthemintoparameter-drivenandparameter-freemethods.Wefirstfocusonparameter-drivenoptimization,coveringfine-tuning-basedoptimization,reinforcementlearning-basedoptimization,andhybridstrategies,analyzingkeyaspectssuchastrajectorydataconstruction,fine-tuningtechniques,rewardfunctiondesign,andoptimizationalgorithms.Additionally,webrieflydiscussparameter-freestrategiesthatoptimizeagentbehaviorthroughpromptengineeringandexternalknowledgeretrieval.Finally,wesummarizethedatasetsandbenchmarksusedforevaluationandtuning,reviewkeyapplicationsofLLM-basedagents,anddiscussmajorchallengesandpromisingfuturedirections.Ourrepositoryforrelatedreferencesisavailableat
/YoungDubbyDu/LLM-Agent-Optimization
.
1Introduction
Thedevelopmentofautonomousagentshasbeenalong-termpursuitinArtificialIntelligence(AI).AIagentshaveevolvedfromearlyrule-basedandexpertsystem-basedarchitecturestoreinforce-mentlearning(RL)-drivenagents,whicharenowwidelyappliedinmanyfields[
35
].TraditionalRL-basedagentsoptimizepoliciesthroughinteractionwithenvironments,usingstructuredrewardfunctionstoachievegoalsandimproveperformanceovertime.However,theseapproachesoften
+Correspondingauthor.
Authors’ContactInformation:
ShanghengDu
,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,dsh@.cn;JiabaoZhao,SchoolofComputerScienceandTechnology,DonghuaUniversity,Shanghai,China,jbzhao@;JinxinShi,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,jinxinshi@;ZhentaoXie,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,ecnudavidtao@;XinJiang,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,51275901099@;YanhongBai,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,Lucky_Baiyh@;LiangHe,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,lhe@.
2S.Duetal.
requireextensivetraining,relyonwell-definedstate-actionspaces,andstrugglewithgeneralizationacrossdiversetasks.
Inrecentyears,LargeLanguageModels(LLMs)suchasGPT-4[
120
],PaLM2[
5
],andDeepseek-r1[
52
]haveachievedremarkablesuccess,demonstratingexceptionalcapabilitiesinlanguageunderstanding,reasoning,planningandcomplexdecision-making.Buildingonthesestrengths,LLMscanserveasagents,providingapromisingpathwaytoimproveautonomousdecision-makingandachieveAGI[
169
].UnlikeconventionalRL-basedagents,whichoptimizeexplicitreward-drivenpolicies,LLM-basedagentsoperatethroughtext-basedinstructionsandprompttemplatesandin-contextlearning(ICL),allowinggreaterflexibilityandgeneralization.TheseagentsleveragethecomprehensionandreasoningcapabilitiesofLLMstointeractwithenvironmentsthroughnaturallanguage,executecomplexmulti-steptasks,anddynamicallyadapttoevolvingscenarios.ExistingLLMagentsutilizevariousmethodssuchastaskdecomposition[
64
],self-reflection[
133
],memoryaugmentation[
210
],andmulti-agentcollaboration[
86
]toachievehighperformanceacrossarangeofdomains,includingsoftwaredevelopment[
67
],mathematicalreasoning[
1
],embodiedintelligence
[212
],webnavigation
[28
],andmore.
However,despitetheirstrengths,LLMsarenotinherentlydesignedforautonomousdecision-makingandlong-termtasks.Theirtrainingobjectivesfocusonnext-tokenpredictionratherthanreasoning,planning,orinteractivelearningrequiredforagent-basedtasks,sotheylackexplicittrainingonagent-centrictasks.Asaresult,deployingLLMsasagentsincomplexenvironmentspresentsseveralkeychallenges:1)LLM-basedagentsstrugglewithlong-horizonplanningandmulti-stepreasoning,astheirgenerativecontentmayleadtotaskinconsistenciesorerroraccumulationoverextendedinteractions.2)LimitedmemorycapacityinLLMshindersagentsfromutilizingpastexperiencesforreflection,leadingtosuboptimaldecision-makingandtaskperformance.3)TheadaptabilityofLLM-basedagentstonovelenvironmentsisconstrained,astheyprimarilyrelyonpre-trainedknowledgeorfixedcontexts,limitingtheirabilitytohandledynamicscenarios.Theselimitationsareparticularlyevidentinopen-sourceLLMs,whichlagbehindproprietarymodelslikeGPT-4inagent-specificcapabilities.Additionally,thehighcostandlackoftransparencyofclosed-sourceLLMshighlighttheneedforoptimizingopenLLMstoenhanceagentcapabilities.
Existingtechniques,suchassupervisedfine-tuning(SFT)[
122
]andreinforcementlearningwithhumanfeedback(RLHF)[
121
],havemadesignificantstridesinimprovingLLMperformanceininstructionfollowingtasks,buttheyfailtofullyaddressthechallengesofdecision-making,long-termplanning,andadaptabilityforLLM-basedagents.OptimizingLLM-basedagentsrequiresabroaderunderstandingofdynamicenvironmentsandagentbehaviors,whichneedstodesignspecializedtechniquesthatgobeyondtraditionalLLMfine-tuningandpromptengineeringmethods.Toaddressthesechallenges,numerousrecentstudieshaveexploredvariousstrategiestooptimizeLLM-basedagentsforcomplexagenttasks.Thesemethodsensurethatagentscangeneralizeacrossdiverseenvironments,refinestrategiesbasedonfeedback,andefficientlyutilizeexternalresourcessuchastools,memory,andretrievalmechanisms.
Inthispaper,weprovideacomprehensivesurveyonLLM-basedagentoptimization,system-aticallycategorizingmethodsintoparameter-drivenandparameter-freeoptimizationstrategies.Ourworkfocusesonthetechnicalmethodologiesemployedtooptimizeagentcapabilitieslikeagenttuning,RL,andotherstoimproveagentperformance.Specifically,Parameter-drivenOptimizationrefinesLLMparameterstoenhanceagentperformance.Thiscategoryincludesconventionalfine-tuningapproaches,coveringkeystagessuchasagenttrajectorydataconstructionandfine-tuningstrategies.Inaddition,weexploreRL-basedoptimization,whichisdividedintotwodistinctoptimizationdirections:rewardfunction-basedmethodsleveragingtraditionalRLtechniqueslikeActor-Critic[
147
]andProximalPolicyOptimization(PPO)[
136
],andpreferencealignment-basedmethodsutilizingDirectPreferenceOptimization(DPO)[
132
]tosynchronize
ASurveyontheOptimizationofLargeLanguageModel-basedAgents3
agentpolicieswithhumanpreferenceortask-specificobjectives.Finally,wediscusshybridfine-tuningoptimizationstrategies,arisingarea,whichcombineSFTwithRLtoiterativelyrefineagentbehavior.Incontrast,wealsobrieflyoutlineParameter-freeOptimizationmethodsthatfocusonimprovingagentbehaviorwithoutmodifyingmodelparameters.Thesemethodsleveragepromptengineering,in-contextlearningandretrieval-augmentedgeneration(RAG),incorporatingvarioustypesofinformationintopromptstoguideagents’actions.Theyarecategorizedintofeedback-basedoptimization,experience-basedoptimization,tool-basedoptimization,retrieval-augmentedoptimization,andmulti-agentcollaborativeoptimization.
Fig.1.AnOverviewofthePaperOrganization.
Comparisontorelatedsurveys.DespitethegrowingresearchinterestinLLM-basedagents,existingsurveysprimarilyfocusongeneralLLMoptimizationorspecificagentabilitiessuchasplanning,memory,androle-playing,withouttreatingLLM-basedagentoptimizationasadistinctresearcharea.SurveysonLLMoptimizationmainlycoverfine-tuning[
115
,
122
]andself-evolutionapproaches[
150
],butlackdiscussionsonspecializedoptimizationrequiredforagentcapabilities.Ontheotherhand,existingagent-relatedsurveysgenerallycategorizeworksbasedonarchitecturalcomponentssuchasplanning[
64
],memory[
210
],ormulti-agentcoordination[
86
],ratherthansystematicallysummarizingthetechniquesdedicatedtooptimizeLLM-basedagentbehaviorsandperformance.Ascomparison,thisworkisthefirstsurveytowardsLLM-basedagentoptimization
4S.Duetal.
techniques,facilitatingaclearerunderstandingandcomparisonofexistingmethodsandprovidingdirectionsforfutureresearch.
Scopeandrationales.(1)WesurveyonlyLLM-basedagentoptimizationalgorithmstoimproveagenttaskperformance,suchasproblem-solvinganddecision-making,coveringparameter-drivenandparameter-freeapproaches.WeexcludeworkscenteredongeneralLLMefficiency,role-playing,ordialogue;(2)OurselectionincludespapersfromAIandNLPconferencesandjournals,aswellasrecenthigh-impactpreprintsfromarXivtoensurecoverageofthelatestadvancements.(3)Wefocusonstudiespublishedsince2022toreflectrecentadvancementsinLLM-basedagentoptimization.
Organizationofthissurvey.Theschematicrepresentationofthismanuscript’slayoutcanbefoundinFigure
1
.Section
2
providesthebackgroundknowledgeandrelatedconcepts.InSection
3
,wesystematicallyreviewparameter-drivenoptimizationapproachesthatmodifyLLMparameterstoenhanceagentcapabilities,categorizingthemintothreemainstrategies:fine-tuning-basedoptimization(§
3.1
),RL-basedoptimization(§
3.2
),andhybridoptimization(§
3.3
).Section
4
summarizesandclassifiesexistingworkonparameter-freeoptimizationstrategies.Then,Section
5
presentsdatasetsandbenchmarks,whileSection
6
reviewspracticalapplicationsacrossvariousdomains.Finally,Section
7
highlightschallengesandfuturedirections.
2Background
2.1ReinforcementLearning-basedAgentOptimization
RLhaslongbeenafundamentalapproachinagentoptimization,allowingagentstolearnfrominter-actionswithenvironments.CurrentRLmethodsmainlyoptimizeagentbehaviorsusingvalue-basedandpolicy-basedapproaches[
35
,
106
,
117
].Value-basedmethods,suchasQ-learning[
25
,
163
],optimizeanagent’saction-valuefunctiontomaximizelong-termrewards.Thesemethodsareeffectiveindiscreteactionspacesbutstrugglewithhigh-dimensionalstatesoractionspaces.Policy-basedmethods,includingPolicyGradient[
48
,
124
],directlyoptimizetheagent’spolicybyadjustingparametersbasedonrewardgradients.Toimprovestabilityandsampleefficiency,PPO[
136
]introducedaconstraintonpolicyupdates,mitigatingperformancedegradationduringtraining.Actor-Criticmethods[
147
]combinevalueestimationwithpolicylearning,improvingconvergenceefficiencyanddecisionrobustness.Beyondsingle-agentsettings,Multi-AgentRein-forcementLearning(MARL)extendsRLtechniquestoscenariosinvolvingmultipleinteractingagents,enablingbothcooperativeandcompetitivedynamics
[12
,
204
].
Inrecentyears,RLhasalsobeenincreasinglyappliedtoaligningAIagentswithhumanin-tentions,particularlyinpreference-basedoptimization.RLHF[
121
]hasemergedasaprominentapproach,refiningagentpoliciesbasedonhuman-providedsignalstoimprovealignmentwithdesiredbehaviors.DPO[
132
]optimizespoliciesdirectlyfrompreferencedatawithoutrewardmod-eling,improvingalignmentandcontrollability.Overall,RL-basedoptimizationhasevolvedfromearlyvalue-basedandpolicy-basedlearningtomoreadvancedtechniquesthatintegratestructuredfeedbackandmulti-agentcoordination,providingafoundationforimprovingdecision-makinginLLM-basedagents.
2.2LLMFine-Tuning
LLMfine-tuningisacriticalmethodforadaptingpre-trainedmodelstospecifictasksthroughopti-mizingparameters,makingthemmoresuitedtothedesiredapplication.ThemostpopularapproachisSFT,whereLLMsaretrainedonlabeleddatatoimprovetask-specificperformance.InstructionTuningisacommonlyusedmethodinSFT,whereLLMsarefurthertrainedoninstruction-outputpairstoenhancetheirabilitytofollowhumancommands[
98
,
205
].Anothermajordevelopmentisparameter-efficientfine-tuning(PEFT),includingmethodslikeP-Tuning[
103
],LoRA[
59
],and
ASurveyontheOptimizationofLargeLanguageModel-basedAgents5
QLoRA[
30
].Thesetechniquesadjustasmallsubsetofparameters,significantlyreducingthecom-putationalcostoffine-tuningwhilepreservingLLMperformance,makingthemhighlyefficientforreal-worldapplications.Additionally,RLHFhasbeenusedtofine-tuneLLMsbyintegratinghumanfeedback,improvingtheirdecision-makingandoutputalignmentwithuserpreferences[
121
].TheseoptimizationtechniquesenableLLMstoadaptmoreefficientlytoawiderangeoftasks,enhancingtheireffectivenessinreal-worldscenarios.
2.3LLM-basedRAG
RAGcombinesLLMwithexternalinformationretrievalsystemstoenhancetherelevanceandaccuracyofgeneratedoutputs.Byretrievingrelevantdocumentsfromexternalsources,RAGallowsLLMstoaddresstheknowledgeconstraintsinherentinmodels.TheevolutionofRAGmethodshasbeenmarkedbysignificantadvancementsinretrievalandgenerationintegration[
44
].Early,NaiveRAGmethodsfocusondirectlyretrievingrelevantdocumentstoaugmentthegenerativeprocess,improvingthequalityofresponsesintasksrequiringfactualknowledge.ToaddressthechallengesofNaiveRAG,AdvancedRAGisintroduced,refiningtheretrievalprocessbyincorporatingmoreef-fectiveranking,filtering,anddocumentselectionstrategies.Subsequently,ModularRAGintroducesamodularframeworkthatoptimizestheretrievalandgenerativecomponentsindependently.Thismodularapproachenablestask-specificoptimizations,allowingformoreflexibilityandscalabilityinapplicationsacrossdifferentdomains[
8
,
193
].TheseadvancementsinRAGhighlightitspotentialtoenhanceLLMsbyenablingdynamicaccesstoexternalknowledge,makingthemmoreadaptableandcapableofaddressingcomplextasksinreal-worldscenarios.
3Parameter-drivenOptimizationofLLM-basedAgents
ComparisonwithLLMparameteroptimization.Parameter-drivenLLMoptimizationfocuseson"howtocreateabettermodel",aimingtoenhancegenerallanguageunderstanding,instructionfollowing,andbroadtaskperformance.Incontrast,LLM-basedagentparameteroptimizationaddresses"howtousethemodeltosolvecomplexagenttasks",emphasizingdecision-making,multi-stepreasoning,andtaskexecutionindynamicenvironments.AlthoughgeneralLLMoptimizationimprovesfluencyandfactualaccuracyacrossdiverseapplications,LLM-agentoptimizationistask-specific,requiringmodelstoadaptstrategies,interactwithenvironments,andrefinebehaviorsforautonomousproblem-solving.Parameter-drivenoptimizationofLLM-basedagentsprimarilyreliesonexperttrajectorydataorself-generatedtrajectorydataobtainedthroughenvironmentexploration,thenemploysvariousoptimizationtechniquestoiterativelyrefinepoliciesandenhanceperformance.
Inthissection,wediscusshowparameter-drivenoptimizationmethodsimprovetheperformanceofLLM-basedagents.Specifically,wecategorizethesemethodsintothreemaintechnicalapproachesaccordingtodifferentstrategiesforparametertuning:conventionalfine-tuning-basedoptimization,reinforcementlearning-basedoptimization,andhybridoptimization.
3.1ConventionalFine-Tuning-basedOptimization
Conventionalfine-tuning-basedagentoptimizationinvolvestuningpre-trainedLLMs’parametersthroughvariousfine-tuningtechniques,suchasinstructiontuningandparameter-efficientfine-tuning.Trajectoryforfine-tuningtypicallyareconstructedintheformofSFTandisusedtoadjusttheagent’sparameterstobetteralignwithtask-specificrequirements.Theoptimizationprocesstypicallyconsistsoftwomajorsteps:1)constructinghigh-qualitytrajectorydatatailoredtoagenttasks;2)fine-tuningLLM-basedagentsusingthesetrajectorydata,andthecompleteprocessispresentedinFigure
2
.Previousstudies[
40
,
83
,
122
]haveshownthatthequalityoftrainingdatasignificantlyimpactsmodelperformance,highlightingtheimportance
6S.Duetal.
Fig.2.WorkflowofFine-Tuning-basedOptimizationforLLM-basedAgents.
ofgenerating,filtering,andeffectivelyutilizinghigh-qualitytrajectories.Thismakestrajectoryconstructionacriticalstepinthefine-tuningpipeline,directlyinfluencingtheLLM-basedagent’soverallperformance.InTable
1
,weprovideacomprehensiveoverviewoffine-tuning-basedagentoptimizationmethods,highlightingthedataprocessingtechniquesandfine-tuningstrategiesusedineachwork.Itisimportanttonotethatthissectionexcludesfine-tuningmethodsthatinvolvereinforcementlearningorpreferencealignmenttechniques(e.g.,DPO,PPO),whichwillbeaddressedin§
3.2
.Instead,inthissection,weonlyfocusonthepartoftraditionalLLMfine-tuningtechniquesappliedinexistingworks,aimingtoensureeachstageoftheconventionalfine-tuning-basedagentoptimizationworkflowisclearlyintroduced.
3.1.1TrajectoryDataConstructionforAgentFine-Tuning.Theconstructionofhigh-qualitytrajectoryisacrucialstepbeforethefine-tuningofLLM-basedagents,whichaimstoempowerLLMswithagentability.Thisprocessinvolvesthegenerationoftrajectorydata,followedbyevaluationandfiltering,andthepotentialutilizationoflow-qualitysamples,toconstructrefineddatathatmeettherequirementsforeffectivefine-tuning.
DataAcquisitionandGeneration.High-qualitytrajectorydataconstructionbeginswiththeacquisitionandgenerationofinitialdata,whichrequiresnotonlyadiversesetoftrajectories,butalsosufficientalignmentwiththetargettaskstoensureeffectivelearning.Methodsforacquiringandgeneratingsuchdatacangenerallybeclassifiedintofourbroadcategories:expert-annotateddata,strongLLM-generatedtrajectories,self-explorationenvironment-interactiontrajectories,andmulti-agentcollaboration-basedconstruction.Here,weintroducetheutilizationandconstructionprocessesofeachcategoryandreviewtherelevantstudies.
(1)Expert-annotateddata.Expert-annotatedtrajectoriesrefertohigh-qualitydatasetsman-uallycraftedbyhumanexperts,oftenconsideredthegoldstandardforfine-tuning.Thesedataensuretaskreliabilityandalignment,asexpertscanmeticulouslydesignandannotatetrajectoriestailoredtospecificcases.
Manyworks[
14
,
39
,
144
,
158
,
177
]utilizeReAct-styleexperttrajectoriesasinitialdatasets,withdataincludingthoughts,observationsandactions[
189
],whichenableagentstomimicexpertdecision-makingprocessesmoreeffectively.Forinstance,IPR[
177
]leveragessuchtrajectoriestohelpagentsacquirefoundationalskills.Similarly,ETO[
144
]andAGILE[
39
]applyChainofThought
ASurveyontheOptimizationofLargeLanguageModel-basedAgents7
Table1.ComparisonofConventionalFine-Tuning-basedOptimizationforLLM-basedAgents:DataCon-
structionandFine-Tuning.Note:MA-Multi-AgentFramework;LQ-Low-QualityDataUtilization.
Method
TrajectoryAgentDataConstruction
Fine-Tuning
Generation
Filtering
MA
LQ
Fine-tuneApproachBaseModel
AgentTuning
[199]
StrongLLM
HumanorRule
√
√
InstructionTuning
Llama-2-7B/13B/70B
SMART
[197]
Multi-agent
Environment
/
√
LoRA
Llama-2-7B
Agent-FLAN
[22]
Expert
Model
√
√
InstructionTuning
Llama-2-7B
Self-Talk
[153]
Multi-agent
HumanorRule
/
√
LoRA
MosaicAI-7B-Chat
ENVISIONS
[178]
Self-exploration
Environment
/
√
SFT
Llama2-7B/13B-Chat
AgentGym
[170]
StrongLLM&Expert
Environment
/
√
BC
Llama-2-7B-Chat
FireAct
[14]
StrongLLM
Environment
/
/
LoRA
GPT3.5,Llama-2-7B/13B,CodeLlama-7B/13B/34B-Instruct
NAT
[158]
StrongLLM
Environment
/
√
SFT
Llama-2-7B/13B-Chat
AgentLumos
[192]
StrongLLM
HumanorRule
/
/
LoRA
Llama-2-7B/13B
STE
[154]
Self-exploration
Model
/
√
SFT
Llama-2-7B/13B-Chat,Mistral-7B-Instruct
OPTIMA
[19]
Multi-agent
HumanorRule
√
/
SFT
Llama-3-8B
Zhouetal.
[216]
StrongLLM
HumanorRule
√
/
LoRA
OpenChatv3.2,Llama-2-7B,AgentLM-7B
AgentOhana
[202]
Expert
Model
/
/
QLoRA
xLAM-v0.1
COEVOL
[85]
Expert
Model
√
/
SFT
Llama-2-7B,Mistral-7B
AGENTBANK
[143]
StrongLLM
Environment
/
√
InstructionTuning
Llama-2-Chat
ADASWITCH
[146]
Self-exploration
Model
√
√
SFT
DeepSeek-Coder-1.3B,StarCoder2-3B
IPR
[177]
Expert&Self-exploration
Environment
/
√
InstructionTuning
Llama-2-7B
Re-ReST
[33]
Self-exploration
Environment
/
√
LoRA
Llama-2-7B/13B,Llama-3-8B,CodeLlama-13B,VPGen
ATM
[219]
Multi-agent
/
√
/
MITO
Llama-2-7B
Aksitovetal.
[3]
Self-exploration
Model-based
/
/
SFT
PaLM-2-base-series
SWIFTSAGE
[94]
Self-exploration
Environment
√
/
SFT
T5-Large
AGILE
[39]
Expert
/
/
/
BC
Vicuna-13B,Meerkat-7B
NLRL
[40]
Self-exploration
/
/
/
SFT
Llama-3.1-8B-Instruct
ETO
[144]
Expert
/
/
√
BC
Llama-2-7B-Chat
Retrospex
[171]
Expert
/
/
√
BC
Flan-T5-Large,Llama-3-8B-Instruct
ToRA
[49]
StrongLLM
HumanorRule
/
√
BC
Llama-2-series,CodeLlama-series
Sayself
[179]
StrongLLM
HumanorRule
/
/
SFT
Mistral-7B,Llama-3-8B
(CoT)methods[
164
]toexperttrajectoriesforimitationlearning,reinforcingtask-specificbehaviors.Toensurealignmentwithpre-trainedLLMdomains,Agent-FLAN[
22
]transformsReAct-styleexperttrajectoriesintomulti-turndialogue,segmentingthedialogueintodifferenttask-specificturn,suchasinstruction-followingandreasoning.StepAgent[
29
]introducesatwo-phaselearningprocess,whereagentsfirstobservediscrepanciesbetweentheirpoliciesandexperttrajectories,theniterativelyrefinetheiractions.Additionally,AgentOhana[
202
]standardizesheterogeneousagentexperttrajectoriesintoaunifiedformattoimprovedataconsistency.Despitetheirreliabilityandalignmentwithspecifictasks,thesedatasetsareresource-intensiveandlackscalability,makingthemcommonlysupplementedwithotherdataacquisitionmethodstoenhancedatasetdiversity.
(2)StrongLLM-generatedtrajectories.StrongLLM-generatedtrajectoriesleveragepowerfulLLMslikeChatGPTandGPT-4toautonomouslygeneratetask-specificdata.ThesetrajectoriesareusuallyproducedbyreasoningframeworkssuchasReActandCoT,allowingthemodeltointeractwiththeenvironmentandsimulateprocessesofreasoning,decision-makingandacting.
AgentTuning[
199
]andFireAct[
14
]employReActandCoTtoguideagentbehaviorwhileincorporatingReflexion[
139
]refinements,improvingthediversityofgenerateddata.Someworksintegratetoolsandstructuredannotationstoenhancetrajectoryinformativeness.NAT[
158
]generatesmultipletrajectoriesunderdifferenttemperaturesettings,usingReActpromptsandintegratingtoolssuchascalculatorsandAPIsduringinteractions.AgentLumos[
192
]utilizesGPT-4andGPT-4Vtoannotatedatasetswithinplanningandgroundingmodules,producingLUMOS-IandLUMOS-Ostyledata.Othermethodsexploremulti-rolesimulationtoenrichtrajectorycomplexity.Zhouetal.[
216
]employGPT-4tosimulateproblemgenerators,actionplanners,andenvironmentagents,enablingiterativeinteraction-drivendatageneration.AGENTBANK[
143
]alsoleveragesGPT-4forenvironmentinteractiondataandGPT-3.5forCoTrationales,andfinallytransformsthedataintochatbot-styleformatsforimprovedusability.
(3)Self-explorationenvironment-interactiontrajectories.Giventhehighcostsofexpert
annotationandp
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 汉字歇后语课件
- 防雷设计方案汇报总结
- 物理双语说课课件设计框架
- 黄浦区办公室设计
- 护理疾病讲课比赛要点解析
- 婚纱照拍摄全流程指南
- 体育运动与营养卫生
- 2025年《安全生产月》活动实施方案 (2份)-64
- T/SHSOT 015.1-2024皮肤角质层胶带剥离方法及应用第1部分:角质层剥离方法
- 八年级历史上册第四单元新民主主义革命的开始思维导图课件新人教版
- 医院开业宣传策划方案
- 高职《旅游英语》课程标准
- BEC商务英语(中级)阅读模拟试卷11(共405题)
- 《研学旅行基地运营与管理》课件-2.2研学旅行基地产品的开发
- 2024-2030年中国烟草收获机行业市场发展趋势与前景展望战略分析报告
- 《第10课 我喜欢的机器人》参考课件1
- 2024年7月浙江省高中学业水平考试数学试卷真题(含答案详解)
- 2024高考前测试-文综试题卷
- 2024年美国户外露营装备市场现状及上下游分析报告
- 《环境卫生学》考试复习题库(含答案)
- 防止老公出轨的协议书
评论
0/150
提交评论