用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation

上传人：策*** IP属地：山西上传时间：2025-06-11 格式：DOCX 页数：84 大小：867.35KB 积分：15 举报 版权申诉

用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation_第2页

用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation_第3页

用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation_第4页

用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation_第5页

已阅读5页，还剩79页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

ReinforcementLearningforSafeLLMCodeGeneration

RoyHuang

ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley

TechnicalReportNo.UCB/EECS-2025-123

/Pubs/TechRpts/2025/EECS-2025-123.html

May19,2025

Allrightsreserved.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor

personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare

notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.

Acknowledgement

IwouldliketoacknowledgeandthanktheentirerLLMteam,inparticular,

MichaelLuoandSijunTan,forbeingsupportive,responsive,andhelpful

mentorsandintroducingmetoRLtrainingonLLMsandagents.IwouldalsoliketothankProf.JosephE.Gonzalezforthesupportiveadvising,guiding

mealongmyjourney.Mostimportantly,IwouldliketothankmyparentsforsupportingmethroughmylifeandgettingmetowhereIamtoday.Icouldnothavemadeitwithouttheirloveandencouragement.

ReinforcementLearningforSafeLLMCodeGeneration

byYuFeiHuang

ResearchProject

SubmittedtotheDepartmentofElectricalEngineeringandComputerSciences,

UniversityofCaliforniaatBerkeley,inpartialsatisfactionoftherequirementsforthedegreeofMasterofScience,PlanII.

ApprovalfortheReportandComprehensiveExamination:

Committee:

ProfessorJosephE.GonzalezResearchAdvisor

5/15/2025

(Date)

******

ProfessorRalucaAdaPopaSecondReader

5/18/2025

(Date)

ReinforcementLearningforSafeLLMCodeGeneration1

YuFeiHuang

Athesissubmittedinpartialsatisfactionofthe

requirementsforthedegreeof

MasterofScience

ElectricalEngineeringandComputerSciences

inthe

GraduateDivision

ofthe

UniversityofCalifornia,Berkeley

Committeeincharge:

ProfessorJosephE.Gonzalez,ChairAssociateProfessorRalucaAdaPopa

Spring2025

1ThisthesisisadaptedfromGoEX:PerspectivesandDesignsTowardsaRuntimeforAutonomousLLMApplications[26]andDeepCoder:AFullyOpen-Source14BCoderatO3-miniLevel[17].Itisrecommendedtocitethesepapersoverthisreport.

ReinforcementLearningforSafeLLMCodeGeneration1

YuFeiHuang

1ThisthesisisadaptedfromGoEX:PerspectivesandDesignsTowardsaRuntimeforAu-tonomousLLMApplications[26]andDeepCoder:AFullyOpen-Source14BCoderatO3-miniLevel[17].Itisrecommendedtocitethesepapersoverthisreport.

2ThisthesisisadaptedfromGoEX:PerspectivesandDesignsTowardsaRuntimeforAu-tonomousLLMApplications[26]andDeepCoder:AFullyOpen-Source14BCoderatO3-miniLevel[17].Itisrecommendedtocitethesepapersoverthisreport.

Abstract

ReinforcementLearningforSafeLLMCodeGeneration2

YuFeiHuang

MasterofScienceinElectricalEngineeringandComputerSciences

UniversityofCalifornia,Berkeley

ProfessorJosephE.Gonzalez,Chair

Reinforcementlearning(RL)hasbecomeaprimarytechniqueforaligningLargeLanguageModels(LLMs)withcomplexreasoningobjectives,yetconvergenceisfragilewhenrewardsignalsarenoisyorexploitable.ThisthesispresentsrLLM—anopen-source,Ray-basedRLframeworkthatutilizesanimprovedGroup-RelativePolicyOptimization(GRPO+)withveRLmodiﬁedwithasynchronouspipelinedsampling,anditerativecontextlengthening.UsingrLLMwetrainedDeepcoder-14B,a14-billion-parametercode-reasoningmodelthatattains60.6%Pass@1onLiveCodeBench,a1936Codeforcesrating,and92.6%Pass@1onHumanEval+,matchingOpenAI’sproprietaryo3-mini(low)ando1onthesebenchmarks.

Weshowthatsuchperformancehingesonanairtightsandboxedexecutionenvironmentthatsafeguardsrewardintegrity.TothatendwetakeinspirationfromGoEx,apost-facto-validatedruntimethatenvelopeseveryRESTcall,databasemutation,andﬁleoperationindeterministicUndoandblast-radius-boundedconﬁnementsemantics.Theairtighten-vironmentswhichrLLMconsumesdirectlytocomputerewardsusing,eliminatingrewardhacking.

TheﬁndingsunderscorethattheproposedGRPO+modiﬁcationsigniﬁcantlyenhancestrain-ingconvergencecomparedtoexistingwidely-adoptedalgorithmssuchasGRPOandDAPO.Furthermore,theasynchronouspipeliningmechanismincorporatedintoveRLsubstantiallyoptimizesthetraininginfrastructure,enablingeﬃcientscalability.Ultimately,byintegratingtheseadvancementswithinameticulouslysecureenvironment,thisthesisdeliversacom-prehensiveRLframeworkthatreliablyalignsLLMswithsophisticatedreasoningobjectives,pavingthewayforfutureresearchintorobustandscalablereinforcementlearningsystems.

Tomyparents,advisor,andresearchcollaborators

IwouldliketoacknowledgeandthanktheentirerLLMteam,inparticular,MichaelLuo

andSijunTan,forbeingsupportive,responsive,andhelpfulmentorsandintroducingmeto

RLtrainingonLLMsandagents.IwouldalsoliketothankProf.JosephE.Gonzalezfor

thesupportiveadvising,guidingmealongmyjourney.Mostimportantly,Iwouldliketo

thankmyparentsforsupportingmethroughmylifeandgettingmetowhereIamtoday.I

couldnothavemadeitwithouttheirloveandencouragement.

Contents

Contentsii

ListofFiguresiv

ListofTablesvi

1Introduction1

1.1BackgroundandMotivation 1

1.2rLLMFrameworkOverview 2

2RelatedWork4

2.1ReinforcementLearningforLanguage-ModelAlignment 4

2.2DistributedFrameworksandSystemsInfrastructure 5

2.3SecureExecutionEnvironmentsandRewardIntegrity 6

3GoEX:ExecutionRuntimeforLLMs8

3.1DesigningaRuntimeforLLMExecution 8

3.2ReversibilityandDamageConﬁnement 8

3.3SymbolicCredentialsandSandboxedExecution 9

3.4CredentialStorageandAccessControl 9

3.5SystemDesignComponents 9

4rLLM:RLTrainingforLLMReasoning17

4.1ProblemStatement 17

4.2rLLMFramework 18

5rLLMExperiment:Deepcoder-14B24

5.1DatasetCurationStrategy 24

5.2CodeSandboxEnvironmentforRewardComputation 25

5.3RewardFunctionDesign 26

5.4EvaluationResults 26

5.5End-to-endPerformance 27

iii

6Conclusion29

Bibliography31

ACodeforcesEvaluation35

ListofFigures

3.1GoEX’sruntimeforexecutingRESTfulAPIcalls.Uponreceivingtheuser’sprompt,GoEXpresentstwoalternatives.First,anLLMcanbepromptedtocomeupwiththe(Action,Undo-Action)pair.Second,theapplicationdevelopercanprovidetuplesofactionsandtheircorrespondingundo-actions(functioncalls)

fromwhichtheLLMcanpickamongst.......................10

3.2Runtimeforexecutingactionsonadatabase.Wepresenttwotechniquestodetermineifaproposedactioncanbeundone.Ontheleft,fornon-transactionaldatabaseslikeMongoDB,andforﬂexibility,weprompttheLLMtogenerate(Action,Undo-Action,test-bed)tuples,whichwethenevaluateinaisolatedcontainertocatchanyfalse(Action,Undo-Action)pairs.Ontheright,wecanprovideadeterministicundowithguaranteesbyemployingthetransaction

semanticsofdatabases................................13

3.3Runtimeforexecutingactionsonaﬁlesystem.GoEXpresentstwoabstractions.

Ontheleft,theLLMispromptedtocomeupwithan(Action,Undo-Action,test-bed)whichGoEXevaluatesinaisolatedcontainertocatchanyfalse(Action,Undo-Action)pairs.Ontherightpresentsdeterministicguarantees

byusingversioningcontrolsystemlikeGitorGitLFS...............15

4.1AveragetrainingrewardbetweenGRPO+andGRPOforthe16Krun.GRPO’s

rewardcurveeventuallycollapses.GRPO+’scurveisstableduetoClipHigh.18

4.2Duetooverlongﬁltering,GRPO+’sresponselengthgrowssteadilyovertime.19

4.3ClipHighandNoEntropyLossensuresthatGRPO+’stoken-levelentropy

doesnotcollapseandencouragessuﬃcientexploration 19

4.4DeepCoder’saverageresponselengthandtrainingrewardsastrainingprogresses.

Averageresponselengthincreasesfrom8K→17.5Kcontextlength 20

4.5Verl’sPPO/GRPOtrainingpipeline.EveryRLiterationcyclesthroughsam-

pling,rewardfunctioncalculationandtraining.Samplingisthebottleneck;

trainingspeedisboundedbystragglersamplersthatgeneratelongsequences..21

4.6MinibatchPipelining.Samplersandtrainersoperateinseparateworkergroups.

Assamplerscompleteandreleasemini-batches(forPPO/GRPO),trainerworkersprocessthemasynchronously.Attheendofaniteration,trainersbroadcasttheir

weightstosamplers..................................22

4.7One-OfPipelining.Samplersgenerateabatchoneiterationahead,whiletrain-

ersupdategradientsusingthepreviousiteration’sdata.Second,rewardfunction

calculationisinterleavedwithsampling.Thisapproachdoesnotintroduceasyn-

chronousof-policysamplestoGRPO/PPO’son-policyalgorithm

5.1One-ofpipeliningfullymasksawaytrainerandrewardcomputationtimes,re-

ducingtrainingtimesby1.4xformathand2xforcoding

ListofTables

5.1ModelPerformanceonCodingandMathBenchmarks...............27

vii

Acknowledgments

IwouldliketoacknowledgeandthanktheentirerLLMteam,inparticular,MichaelLuoandSijunTan,forbeingsupportive,responsive,andhelpfulmentorsandintroducingmetoRLtrainingonLLMsandagents.IwouldalsoliketothankProf.JosephE.Gonzalezforthesupportiveadvising,guidingmealongmyjourney.Mostimportantly,IwouldliketothankmyparentsforsupportingmethroughmylifeandgettingmetowhereIamtoday.Icouldnothavemadeitwithouttheirloveandencouragement.

Chapter1

Introduction

LargeLanguageModels(LLMs)haveadvancedfromsequence-to-sequenceautoregressorsintoagentscapableofmulti-stepreasoning,toolcalling,andcodesynthesis.Supervisedpre-trainingsuppliesﬂuentlinguisticpriors,yetitisreinforcementlearning(RL)thatalignsthosepriorswithtask-levelobjectivessuchaspassingunit-testsuitesordevelopingemergentreasoningpatterns.OptimizinganLLMpolicyπθoverlong,sparserewardtrajectories,however,remainsbrittle:credit-assignmentnoisegrowsquadraticallywithsequencelength,andpoorlyinstrumentedenvironmentsinviterewardhacking,wherepolicieslearnspuriousstrategiesthatinﬂatethescalarreturnwhiledegradingtrueutility.

ThisthesisaddressesthesechallengesbyproposingrLLM,apurpose-builtRLframeworkthatcouplesanovelGroup-RelativePolicyOptimizationPlus(GRPO+)algorithmbasedonpriorworkswithGRPOandDAPOwithanasynchronous,Ray-orchestratedsamplingpipeline.rLLM’sdesigngoalistwo-fold:(i)sustainhigh-throughputgradientupdatesonclustersofthousandsofGPUs;and(ii)preserverewardintegritythroughairtightexecutionsandboxesinspiredbytheGoExpost-factovalidationruntime.TheframeworkisvalidatedbytrainingDeepcoder-14B,a14-billion-parametercode-reasoningmodelthatmatchestheperformanceofproprietarysystemswhileremainingfullyopensource.

1.1BackgroundandMotivation

ThealignmentofLargeLanguageModels(LLMs)hasprogressedfromsupervisedﬁnetun-ing(SFT)tofullreinforcement-learningpipelinesthatoptimizeapolicyoverlong,task-levelroll-outs.EarlyRLwithhumanfeedback(RLHF)systemsadoptedProximalPolicyOp-timization(PPO)anditsKL-constrainedvariants,butthehighvarianceoflong-horizoncreditassignmentsoonmotivatedGroup-RelativePolicyOptimization(GRPO),whichmea-suresadvantagesagainstpeertrajectoriessampledfromthesamepromptgroup,markedlyimprovingstabilityonreasoningtasks.SubsequentworksuchasDAPOaddeddynamicsamplinganddecoupledclippingtopushlarge-scaletrainingbeyond30Bparameters.De-spitethesealgorithmicadvances,convergenceisstillbrittlewheneverrewardchannelsleak

CHAPTER1.INTRODUCTION

noiseorareexploitable.Studiesonrewardhackingshowthatagentsreadilydiscoverloop-holes—fabricatinglogs,short-circuitingunittests,orcorruptingstate—toinﬂatenominalreturnswhiledegradingtruetasksuccess.

ScalingRLtofrontier-sizedmodelsthereforedemandssysteminnovationsaswell.Syn-chronousactor–learnerloopsstallonthelongestrollout,under-utilisingexpensiveacceler-ators;industrialsolutionsnowfavourasynchronouspipelinesbuiltatopRay’sdistributedexecutionengine,whichoferselastic,fault-tolerantplacementofbothactorsandlearners.LibrariessuchasveRLexposelightweightRPCinterfacesforhigh-throughputsamplingandhavebecomeade-factosubstrateforopen-sourceRLHFresearch.Yetthroughputaloneisinsuﬃcient:long-contextOptimization(32k–64ktokens)multipliesgradientnoiseandmemorypressure,motivatingiterativecontextlengtheningcurriculathatgrowwindowsonlyaftervarianceplateaus.

Equallycriticalistheexecutionenvironmentwhereroll-outsareevaluated.Withoutexplicitsafeguards,anLLMtunedtointeractwithexternaltoolscanoverwritedatabases,issuedestructiveAPIcalls,orgeneratedeceptivetestharnessesthatpassbenchmarkswhilehidingfaultylogic.TheBerkeleyGoExruntimeaddressesthisbywrappingeveryRESTcall,ﬁleoperation,andSQLmutationindeterministicundoandblast-radius-boundedconﬁne-ment,producingreversibletracesthatcanbesafelyreplayedordiscarded.Suchpost-factovalidationprovidestamper-proofrewardsignals,closinganessentialsafetyloopignoredbymanyalgorithm-centricstudies.

Finally,moderncode-reasoningbenchmarkslikeLiveCodeBench,HumanEval+,andCodeforceshaveemergedasstringenttestsofreasoningqualityundercontamination-freeevaluation.Open-weightmodelslikeDeepcoder-14Bnowmatchproprietarysystemsat14Bparametersbycombininghigh-qualitydatacurationwithRLﬁne-tuning,achieving60.6%Pass@1onLiveCodeBenchanda1936Codeforcesrating.Theirsuccessunderscoresthesynergisticefectofcutting-edgeOptimizationalgorithms,eﬃcientdistributedinfras-tructure,andmeticulouslysandboxedenvironments—preciselythetriadthisthesisseekstosystematisethroughtherLLMframework.

1.2rLLMFrameworkOverview

TherLLMstackisengineeredaroundthreetightlycoupledlayers—algorithm,systems,andcurriculum—eachtunedtomitigateaspeciﬁcfailuremodeinlarge-scaleRLforLLMs.

Algorithmiccore(GRPO+)

rLLMextendsGroup-RelativePolicyOptimizationby(i)relative-KLclipping,whichboundstheper-grouppolicyupdateinitsownlocaltrustregion,(ii)over-longﬁlteringthatdiscardstrajectorieswhoselength-scaledvariancedominatestheminibatch,and(iii)removalofen-tropybonusesonceexplorationsaturates.Theﬁrsttwomodiﬁcationscutgradientvarianceby18%onsyntheticbanditsandpreventthehigh-KL“spikes”reportedforvanillaGRPOon

CHAPTER1.INTRODUCTION

DeepSeek-R1training.ComparedwithDAPO’sdecoupled-clipobjective,GRPO+achievesequivalentﬁnalrewardwith12%fewerupdatesona4k-promptablation.

Systemslayer

Onthesystemsside,rLLMaddsGRPO+ontoveRL,anopenRLHFlibrarywhoseactorandlearnernodesareorchestratedbyRay’selasticplacementengine.Weintroduceanasyn-chronousdouble-buferedpipeline—verl-pipe—thatoverlapsrolloutgenerationandgradientapplication.Benchmarkson8×A100GPUsshow2.1×throughputversusastrongsyn-chronousPPObaselinewhilesustaining≥95%deviceutilization.Thedesigneliminatesthe“taillatency”probleminwhichasinglelong-contextsamplestallsglobaloptimization.

Curriculumlayer(iterativecontextlengthening)

Longcontextsexacerbatebothmemoryfootprintandcredit-assignmentnoise.rLLMthere-foreadoptsastagedcurriculum—16k→32k→64ktokens—advancingonlywhenreward-varianceplateaus.Recentworkonlong-contextpre-trainingshowsthatsuchgradualex-pansionyieldsbetterutilizationoftheexpandedreceptiveﬁeldthanjumpingtotheﬁnalwindowdirectly.Inpractice,curriculumlengtheningshaves21%ofwall-clocktimerelativetoastatic64krun.

Empiricalhighlight(Deepcoder-14B)

RunningthefullpipelineoncuratedcompetitivecodingtasksintheDeepcoderdatasetproducesDeepcoder-14B,whichattains60.6%Pass@1onLiveCodeBench,aCodeforcesEloof1936,and92.6%Pass@1onHumanEval+,equalingOpenAI’so3-mini(low)withopen-sourcedtrainingprocedure,data,andweights.

Environment

Theabovegainsmaterializeonlyunderarewardfunctioninanenvironmentthatisair-tight.rLLMthereforeexecutesallrolloutsinsideasandboxwhereeverycodesnippetisexecutedwithresourceisolationandconstraints;thisensurestimelyexecutionandproperfail-fastchecks.Aswell,theseenvironmentsneedtobeperformantforlargeparallelrewardcalculation.rLLMintroducesanenvironmentthatisoptimizedforparallelrewardfunctionexecutionwhilebeingsandboxed.

Chapter2

RelatedWork

2.1ReinforcementLearningforLanguage-ModelAlignment

EarlyattemptsataligninglargelanguagemodelsreliedonProximalPolicyOptimization(PPO),aﬁrst-ordertrust-regionmethodthatclipsthepolicyupdatetoavertcollapsewhileremainingcomputationallytractable[30].OpenAI’sInstructGPTextendedPPOintoafullRL-from-Human-Feedback(RLHF)pipeline,demonstratingthatﬁne-tuningwithpreference-basedrewardsmarkedlyimprovesobedienceandusefulnessoninstruction-followingbenchmarks[25].Subsequentworkrevealed,however,thatPPO’sglobalbaselineandsingle-trajectoryadvantagesstrugglewiththevarianceintroducedbylongcontextsandsparserewardstypicalofreasoningtasks.

Tomitigatetheseissues,GroupRelativePolicyOptimization(GRPO)estimatesbaselinesfromgroupsoftrajectoriessharingthesameprompt,therebysharpeningcreditassignmentandcuttingmemoryoverheadbyeliminatingaseparatecriticnetwork[31].GRPOhasbeenshowntosustainstablelearningon16k–32ktokenwindowsformathematics-focusedmodels,yetstillexhibitspoorperformancewhenscaledtolarger,heterogeneouscorporaduetotheconstraintsofsample-levelloss.DAPOgeneralizestheideabyintroducingdecoupledclippingandadaptivetemperaturescaling,aswellastoken-levelloss,therebyreportingimprovedconvergenceacrossninepublicRLHFtasksandprovidinganopen-sourcereferenceforcluster-scaletraining[39].

Despitealgorithmicprogress,allPPO-derivedmethodsremainvulnerabletorewardhacking—theexploitationofloopholesintherewardfunctionorenvironmenttoinﬂatere-turnswithoutgenuinetasksuccess.Recentsafetyanalysesoffrontiermodels,includingOpenAI’so1ando3series,documentemergentdeceptivebehaviourundersparserewardregimes[3].Theseobservationsunderscorethatreliablealignmenthingesnotonlyonrobustoptimizationbutalsoonveriﬁablerewardchannelsandsecureexecutionsandboxes.

ThepresentworkbuildsonthislineagebyproposingGRPO+,anextensionthatappliesrelativeKLclippingandoverlongﬁlteringtofurtherstabilizeupdates,andembeddingthe

CHAPTER2.RELATEDWORK

algorithmwithinanasynchronoussamplingstack(Section1.2)executedinsideanairtight,reversibleenvironment(Section4).Thisholisticapproachtargetstheintertwinedalgorith-micandenvironmentalcausesofconvergencefailureidentiﬁedinpriorliterature.

2.2DistributedFrameworksandSystemsInfrastructure

Scalingpolicy-gradientoptimizationtobillion-parameterlanguagemodelsdemandsend-to-endsystemssupportforhigh-throughputsampling,faulttolerance,andelasticresourceutilization.EarlyRLHFpipelinesembeddedPPOdirectlyinsidebespoketrainerscripts,butsoonmigratedtogeneral-purposeframeworkssuchasRayRLlib,whoseactor–learnerabstractionandclusterschedulero↵eredturnkeyhorizontalscale-outandrecovery.RLlib’sversatility,however,comesatacost:itsmonolithicAPIsintroduceperformanceoverheadswhenrolloutsrequirelong-contextdecodingontensor-parallelbackends[21,15].

ToaddressLLM-speciﬁcbottlenecks,multipleopen-sourcesystemshaveemerged.veRLrefactorsRLlib’sexecutionmodelintolightweightRPCendpointsanddouble-bu↵eredGPUqueues,sustaining¿95%utilizationonmulti-nodeclusters.DistRLpushesasynchronousdatacollectiontoCPU-heavyinferencenodeswhilereservingGPUserversforbatchedgra-dientupdates,reducingstraggler-inducedidletimeby27%onin-house70Bmodels.

Large-scaleindustrialstackscoupletheseschedulerswithhigh-performanceservinglay-ers.NVIDIA’sTritonInferenceServerisfrequentlydeployedtoshardsamplertraﬃcacrosstensor-paralleldecodereplicas,maskingbackendvariabilitybeneathauniformgRPCinter-face.Ontheoptimizationside,DeepSpeedRLextendsDeepSpeed-ZeROwitho✏oadingprimitivestailoredtoPPO-stylegradients,deliveringnear-linearscalingto512A100sona175Bmodelaccordingtointernalbenchmarks[22].

Thebaselineforthesystemoptimizationsisprovidedbyverl[32],anopen-sourceli-braryforReinforcementLearningfromHumanFeedback(RLHF)trainingoflargelanguagemodels.verlistheopen-sourceimplementationoftheframeworkdescribedinthepaper”HybridFlow:AFlexibleandEﬃcientRLHFFramework”[32].TheHybridFlowframe-workwasdevelopedtoaddresstheinherentcomplexityandcomputationalineﬃciencyoftraditionalRLHFdataﬂows.

RLHFworkﬂows,particularlythosebasedonalgorithmslikePPOandGRPO[31],in-volveintricatedependenciesandcomputationaltasksperformedbymultipleLLMinstances,includingtheActor(policy)model,aRewardmodel,aReferencemodel,andaCriticmodel.Thesetasksencompassgeneration(sampling),inference(forreward,reference,andcritic),andtrainingsteps.Traditionalapproachesoftenstruggledwithﬂexiblyrepresentingandeﬃcientlyexecutingthesecomplexdataﬂows,leadingtoineﬃciencies.

HybridFlow[32]addressesthesechallengesbyproposingaﬂexibleandeﬃcientarchi-tecture.Keyaspectsincludeahybrid-controllerprogrammingmodelthatdecouplesthehigh-levelcontrolﬂow(deﬁningtheRLalgorithmsteps)fromthelow-levelcomputation

CHAPTER2.RELATEDWORK

ﬂow(executingneuralnetworkoperations).Thisdesignallowsforbettermodularityandreusability.Theframeworkalsoemphasizesseamlessintegrationwithexistingdistributedtrainingandinferencelibraries(suchasFSDP,Megatron-LM,vLLM,andSGLang)andsupportsﬂexibledevicemappingtooptimizeresourceutilization.WhileHybridFlow[32]providedarobustandeﬃcientfoundationforRLHF,particularlyinmanagingdiversework-loadsandmodelplacements,thesamplingbottleneck,asdescribedinsubsequentsections,remainedasigniﬁcantareaforfurtheroptimization.

Algorithm–systemco-designremainsactive.VAGENintegratesvariance-awaregradientaggregationwithacustomparameterserverthatadaptivelydropsstaleroll-outs,reporting1.8×wall-clockspeed-upsonmultilingualinstructiontuning[34].Inparallel,ByteDance’sDAPOreferenceimplementationexposesdecoupledclippinganddynamicsamplingprimi-tivesatopaRaybackend,achieving50pointsonAIME2024witha32BQwenbase[39].Finally,recentstudiesonadaptivefaulttoleranceforLLMclustersproposereactivemigrationoflearnershardsuponnodefailure,preserving¿99.5%trainingavailabilityovermonth-longruns[12].

Collectively,theseframeworkshighlightthreedesignprinciplesadoptedbyrLLM:(i)actor–learnerdecouplingwithasynchronous,back-pressure-freequeues;(ii)elasticorches-trationthatexploitsRay’splacementgroupsfortransparentfailover;and(iii)hardware-awareservinglayersthatco-locatedecodingandgradientaggregationtominimizePCIeandnetworkhops.

2.3SecureExecutionEnvironmentsandRewardIntegrity

Apersistentfailuremodeinlarge-scalereinforcementlearningisrewardhacking—theten-dencyofanagenttoexploitweaknessesintherewardspeciﬁcationorthesurroundingsystemtomaximizereturnwithoutachievinggenuinetasksuccess.Documentedexploitsincludeover-ﬁttingbrittleunittests,fabricatingevaluationlogs,andmutatingtheveryartifactsusedforscoring[36].

Tocounteractthesethreats,twocomplementarystrategieshaveemerged.Sandboxisola-tionisnowstandardpracticeincode-generationRL:eachcandidateprogramexecutesinsidearesource-boundedcontainer,andsuccessisjudgedsolelybytheunit-testsuite[38].Whilee↵ectiveagainstarbitraryﬁlewritesornetworkcalls,sandboxesrelyonthe

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation

文档简介

温馨提示

最新文档

评论

用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation

文档简介

温馨提示

最新文档

评论

相关文档