用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation_第1页
用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation_第2页
用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation_第3页
用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation_第4页
用于安全大语言模型代码生成的强化学习 Reinforcement Learning for Safe LLM Code Generation_第5页
已阅读5页,还剩79页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

ReinforcementLearningforSafeLLMCodeGeneration

RoyHuang

ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley

TechnicalReportNo.UCB/EECS-2025-123

/Pubs/TechRpts/2025/EECS-2025-123.html

May19,2025

Copyright©2025,bytheauthor(s).

Allrightsreserved.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor

personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare

notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.

Acknowledgement

IwouldliketoacknowledgeandthanktheentirerLLMteam,inparticular,

MichaelLuoandSijunTan,forbeingsupportive,responsive,andhelpful

mentorsandintroducingmetoRLtrainingonLLMsandagents.IwouldalsoliketothankProf.JosephE.Gonzalezforthesupportiveadvising,guiding

mealongmyjourney.Mostimportantly,IwouldliketothankmyparentsforsupportingmethroughmylifeandgettingmetowhereIamtoday.Icouldnothavemadeitwithouttheirloveandencouragement.

ReinforcementLearningforSafeLLMCodeGeneration

byYuFeiHuang

ResearchProject

SubmittedtotheDepartmentofElectricalEngineeringandComputerSciences,

UniversityofCaliforniaatBerkeley,inpartialsatisfactionoftherequirementsforthedegreeofMasterofScience,PlanII.

ApprovalfortheReportandComprehensiveExamination:

Committee:

ProfessorJosephE.GonzalezResearchAdvisor

5/15/2025

(Date)

******

ProfessorRalucaAdaPopaSecondReader

5/18/2025

(Date)

ReinforcementLearningforSafeLLMCodeGeneration1

by

YuFeiHuang

Athesissubmittedinpartialsatisfactionofthe

requirementsforthedegreeof

MasterofScience

in

ElectricalEngineeringandComputerSciences

inthe

GraduateDivision

ofthe

UniversityofCalifornia,Berkeley

Committeeincharge:

ProfessorJosephE.Gonzalez,ChairAssociateProfessorRalucaAdaPopa

Spring2025

1ThisthesisisadaptedfromGoEX:PerspectivesandDesignsTowardsaRuntimeforAutonomousLLMApplications[26]andDeepCoder:AFullyOpen-Source14BCoderatO3-miniLevel[17].Itisrecommendedtocitethesepapersoverthisreport.

ReinforcementLearningforSafeLLMCodeGeneration1

Copyright2025

by

YuFeiHuang

1ThisthesisisadaptedfromGoEX:PerspectivesandDesignsTowardsaRuntimeforAu-tonomousLLMApplications[26]andDeepCoder:AFullyOpen-Source14BCoderatO3-miniLevel[17].Itisrecommendedtocitethesepapersoverthisreport.

2ThisthesisisadaptedfromGoEX:PerspectivesandDesignsTowardsaRuntimeforAu-tonomousLLMApplications[26]andDeepCoder:AFullyOpen-Source14BCoderatO3-miniLevel[17].Itisrecommendedtocitethesepapersoverthisreport.

1

Abstract

ReinforcementLearningforSafeLLMCodeGeneration2

by

YuFeiHuang

MasterofScienceinElectricalEngineeringandComputerSciences

UniversityofCalifornia,Berkeley

ProfessorJosephE.Gonzalez,Chair

Reinforcementlearning(RL)hasbecomeaprimarytechniqueforaligningLargeLanguageModels(LLMs)withcomplexreasoningobjectives,yetconvergenceisfragilewhenrewardsignalsarenoisyorexploitable.ThisthesispresentsrLLM—anopen-source,Ray-basedRLframeworkthatutilizesanimprovedGroup-RelativePolicyOptimization(GRPO+)withveRLmodifiedwithasynchronouspipelinedsampling,anditerativecontextlengthening.UsingrLLMwetrainedDeepcoder-14B,a14-billion-parametercode-reasoningmodelthatattains60.6%Pass@1onLiveCodeBench,a1936Codeforcesrating,and92.6%Pass@1onHumanEval+,matchingOpenAI’sproprietaryo3-mini(low)ando1onthesebenchmarks.

Weshowthatsuchperformancehingesonanairtightsandboxedexecutionenvironmentthatsafeguardsrewardintegrity.TothatendwetakeinspirationfromGoEx,apost-facto-validatedruntimethatenvelopeseveryRESTcall,databasemutation,andfileoperationindeterministicUndoandblast-radius-boundedconfinementsemantics.Theairtighten-vironmentswhichrLLMconsumesdirectlytocomputerewardsusing,eliminatingrewardhacking.

ThefindingsunderscorethattheproposedGRPO+modificationsignificantlyenhancestrain-ingconvergencecomparedtoexistingwidely-adoptedalgorithmssuchasGRPOandDAPO.Furthermore,theasynchronouspipeliningmechanismincorporatedintoveRLsubstantiallyoptimizesthetraininginfrastructure,enablingefficientscalability.Ultimately,byintegratingtheseadvancementswithinameticulouslysecureenvironment,thisthesisdeliversacom-prehensiveRLframeworkthatreliablyalignsLLMswithsophisticatedreasoningobjectives,pavingthewayforfutureresearchintorobustandscalablereinforcementlearningsystems.

i

Tomyparents,advisor,andresearchcollaborators

IwouldliketoacknowledgeandthanktheentirerLLMteam,inparticular,MichaelLuo

andSijunTan,forbeingsupportive,responsive,andhelpfulmentorsandintroducingmeto

RLtrainingonLLMsandagents.IwouldalsoliketothankProf.JosephE.Gonzalezfor

thesupportiveadvising,guidingmealongmyjourney.Mostimportantly,Iwouldliketo

thankmyparentsforsupportingmethroughmylifeandgettingmetowhereIamtoday.I

couldnothavemadeitwithouttheirloveandencouragement.

ii

Contents

Contentsii

ListofFiguresiv

ListofTablesvi

1Introduction1

1.1BackgroundandMotivation 1

1.2rLLMFrameworkOverview 2

2RelatedWork4

2.1ReinforcementLearningforLanguage-ModelAlignment 4

2.2DistributedFrameworksandSystemsInfrastructure 5

2.3SecureExecutionEnvironmentsandRewardIntegrity 6

3GoEX:ExecutionRuntimeforLLMs8

3.1DesigningaRuntimeforLLMExecution 8

3.2ReversibilityandDamageConfinement 8

3.3SymbolicCredentialsandSandboxedExecution 9

3.4CredentialStorageandAccessControl 9

3.5SystemDesignComponents 9

4rLLM:RLTrainingforLLMReasoning17

4.1ProblemStatement 17

4.2rLLMFramework 18

5rLLMExperiment:Deepcoder-14B24

5.1DatasetCurationStrategy 24

5.2CodeSandboxEnvironmentforRewardComputation 25

5.3RewardFunctionDesign 26

5.4EvaluationResults 26

5.5End-to-endPerformance 27

iii

6Conclusion29

Bibliography31

ACodeforcesEvaluation35

iv

ListofFigures

3.1GoEX’sruntimeforexecutingRESTfulAPIcalls.Uponreceivingtheuser’sprompt,GoEXpresentstwoalternatives.First,anLLMcanbepromptedtocomeupwiththe(Action,Undo-Action)pair.Second,theapplicationdevelopercanprovidetuplesofactionsandtheircorrespondingundo-actions(functioncalls)

fromwhichtheLLMcanpickamongst.......................10

3.2Runtimeforexecutingactionsonadatabase.Wepresenttwotechniquestodetermineifaproposedactioncanbeundone.Ontheleft,fornon-transactionaldatabaseslikeMongoDB,andforflexibility,weprompttheLLMtogenerate(Action,Undo-Action,test-bed)tuples,whichwethenevaluateinaisolatedcontainertocatchanyfalse(Action,Undo-Action)pairs.Ontheright,wecanprovideadeterministicundowithguaranteesbyemployingthetransaction

semanticsofdatabases................................13

3.3Runtimeforexecutingactionsonafilesystem.GoEXpresentstwoabstractions.

Ontheleft,theLLMispromptedtocomeupwithan(Action,Undo-Action,test-bed)whichGoEXevaluatesinaisolatedcontainertocatchanyfalse(Action,Undo-Action)pairs.Ontherightpresentsdeterministicguarantees

byusingversioningcontrolsystemlikeGitorGitLFS...............15

4.1AveragetrainingrewardbetweenGRPO+andGRPOforthe16Krun.GRPO’s

rewardcurveeventuallycollapses.GRPO+’scurveisstableduetoClipHigh.18

4.2Duetooverlongfiltering,GRPO+’sresponselengthgrowssteadilyovertime.19

4.3ClipHighandNoEntropyLossensuresthatGRPO+’stoken-levelentropy

doesnotcollapseandencouragessufficientexploration 19

4.4DeepCoder’saverageresponselengthandtrainingrewardsastrainingprogresses.

Averageresponselengthincreasesfrom8K→17.5Kcontextlength 20

4.5Verl’sPPO/GRPOtrainingpipeline.EveryRLiterationcyclesthroughsam-

pling,rewardfunctioncalculationandtraining.Samplingisthebottleneck;

trainingspeedisboundedbystragglersamplersthatgeneratelongsequences..21

4.6MinibatchPipelining.Samplersandtrainersoperateinseparateworkergroups.

Assamplerscompleteandreleasemini-batches(forPPO/GRPO),trainerworkersprocessthemasynchronously.Attheendofaniteration,trainersbroadcasttheir

weightstosamplers..................................22

v

4.7One-OfPipelining.Samplersgenerateabatchoneiterationahead,whiletrain-

ersupdategradientsusingthepreviousiteration’sdata.Second,rewardfunction

calculationisinterleavedwithsampling.Thisapproachdoesnotintroduceasyn-

chronousof-policysamplestoGRPO/PPO’son-policyalgorithm

23

5.1One-ofpipeliningfullymasksawaytrainerandrewardcomputationtimes,re-

ducingtrainingtimesby1.4xformathand2xforcoding

27

vi

ListofTables

5.1ModelPerformanceonCodingandMathBenchmarks...............27

vii

Acknowledgments

IwouldliketoacknowledgeandthanktheentirerLLMteam,inparticular,MichaelLuoandSijunTan,forbeingsupportive,responsive,andhelpfulmentorsandintroducingmetoRLtrainingonLLMsandagents.IwouldalsoliketothankProf.JosephE.Gonzalezforthesupportiveadvising,guidingmealongmyjourney.Mostimportantly,IwouldliketothankmyparentsforsupportingmethroughmylifeandgettingmetowhereIamtoday.Icouldnothavemadeitwithouttheirloveandencouragement.

1

Chapter1

Introduction

LargeLanguageModels(LLMs)haveadvancedfromsequence-to-sequenceautoregressorsintoagentscapableofmulti-stepreasoning,toolcalling,andcodesynthesis.Supervisedpre-trainingsuppliesfluentlinguisticpriors,yetitisreinforcementlearning(RL)thatalignsthosepriorswithtask-levelobjectivessuchaspassingunit-testsuitesordevelopingemergentreasoningpatterns.OptimizinganLLMpolicyπθoverlong,sparserewardtrajectories,however,remainsbrittle:credit-assignmentnoisegrowsquadraticallywithsequencelength,andpoorlyinstrumentedenvironmentsinviterewardhacking,wherepolicieslearnspuriousstrategiesthatinflatethescalarreturnwhiledegradingtrueutility.

ThisthesisaddressesthesechallengesbyproposingrLLM,apurpose-builtRLframeworkthatcouplesanovelGroup-RelativePolicyOptimizationPlus(GRPO+)algorithmbasedonpriorworkswithGRPOandDAPOwithanasynchronous,Ray-orchestratedsamplingpipeline.rLLM’sdesigngoalistwo-fold:(i)sustainhigh-throughputgradientupdatesonclustersofthousandsofGPUs;and(ii)preserverewardintegritythroughairtightexecutionsandboxesinspiredbytheGoExpost-factovalidationruntime.TheframeworkisvalidatedbytrainingDeepcoder-14B,a14-billion-parametercode-reasoningmodelthatmatchestheperformanceofproprietarysystemswhileremainingfullyopensource.

1.1BackgroundandMotivation

ThealignmentofLargeLanguageModels(LLMs)hasprogressedfromsupervisedfinetun-ing(SFT)tofullreinforcement-learningpipelinesthatoptimizeapolicyoverlong,task-levelroll-outs.EarlyRLwithhumanfeedback(RLHF)systemsadoptedProximalPolicyOp-timization(PPO)anditsKL-constrainedvariants,butthehighvarianceoflong-horizoncreditassignmentsoonmotivatedGroup-RelativePolicyOptimization(GRPO),whichmea-suresadvantagesagainstpeertrajectoriessampledfromthesamepromptgroup,markedlyimprovingstabilityonreasoningtasks.SubsequentworksuchasDAPOaddeddynamicsamplinganddecoupledclippingtopushlarge-scaletrainingbeyond30Bparameters.De-spitethesealgorithmicadvances,convergenceisstillbrittlewheneverrewardchannelsleak

2

CHAPTER1.INTRODUCTION

noiseorareexploitable.Studiesonrewardhackingshowthatagentsreadilydiscoverloop-holes—fabricatinglogs,short-circuitingunittests,orcorruptingstate—toinflatenominalreturnswhiledegradingtruetasksuccess.

ScalingRLtofrontier-sizedmodelsthereforedemandssysteminnovationsaswell.Syn-chronousactor–learnerloopsstallonthelongestrollout,under-utilisingexpensiveacceler-ators;industrialsolutionsnowfavourasynchronouspipelinesbuiltatopRay’sdistributedexecutionengine,whichoferselastic,fault-tolerantplacementofbothactorsandlearners.LibrariessuchasveRLexposelightweightRPCinterfacesforhigh-throughputsamplingandhavebecomeade-factosubstrateforopen-sourceRLHFresearch.Yetthroughputaloneisinsufficient:long-contextOptimization(32k–64ktokens)multipliesgradientnoiseandmemorypressure,motivatingiterativecontextlengtheningcurriculathatgrowwindowsonlyaftervarianceplateaus.

Equallycriticalistheexecutionenvironmentwhereroll-outsareevaluated.Withoutexplicitsafeguards,anLLMtunedtointeractwithexternaltoolscanoverwritedatabases,issuedestructiveAPIcalls,orgeneratedeceptivetestharnessesthatpassbenchmarkswhilehidingfaultylogic.TheBerkeleyGoExruntimeaddressesthisbywrappingeveryRESTcall,fileoperation,andSQLmutationindeterministicundoandblast-radius-boundedconfine-ment,producingreversibletracesthatcanbesafelyreplayedordiscarded.Suchpost-factovalidationprovidestamper-proofrewardsignals,closinganessentialsafetyloopignoredbymanyalgorithm-centricstudies.

Finally,moderncode-reasoningbenchmarkslikeLiveCodeBench,HumanEval+,andCodeforceshaveemergedasstringenttestsofreasoningqualityundercontamination-freeevaluation.Open-weightmodelslikeDeepcoder-14Bnowmatchproprietarysystemsat14Bparametersbycombininghigh-qualitydatacurationwithRLfine-tuning,achieving60.6%Pass@1onLiveCodeBenchanda1936Codeforcesrating.Theirsuccessunderscoresthesynergisticefectofcutting-edgeOptimizationalgorithms,efficientdistributedinfras-tructure,andmeticulouslysandboxedenvironments—preciselythetriadthisthesisseekstosystematisethroughtherLLMframework.

1.2rLLMFrameworkOverview

TherLLMstackisengineeredaroundthreetightlycoupledlayers—algorithm,systems,andcurriculum—eachtunedtomitigateaspecificfailuremodeinlarge-scaleRLforLLMs.

Algorithmiccore(GRPO+)

rLLMextendsGroup-RelativePolicyOptimizationby(i)relative-KLclipping,whichboundstheper-grouppolicyupdateinitsownlocaltrustregion,(ii)over-longfilteringthatdiscardstrajectorieswhoselength-scaledvariancedominatestheminibatch,and(iii)removalofen-tropybonusesonceexplorationsaturates.Thefirsttwomodificationscutgradientvarianceby18%onsyntheticbanditsandpreventthehigh-KL“spikes”reportedforvanillaGRPOon

3

CHAPTER1.INTRODUCTION

DeepSeek-R1training.ComparedwithDAPO’sdecoupled-clipobjective,GRPO+achievesequivalentfinalrewardwith12%fewerupdatesona4k-promptablation.

Systemslayer

Onthesystemsside,rLLMaddsGRPO+ontoveRL,anopenRLHFlibrarywhoseactorandlearnernodesareorchestratedbyRay’selasticplacementengine.Weintroduceanasyn-chronousdouble-buferedpipeline—verl-pipe—thatoverlapsrolloutgenerationandgradientapplication.Benchmarkson8×A100GPUsshow2.1×throughputversusastrongsyn-chronousPPObaselinewhilesustaining≥95%deviceutilization.Thedesigneliminatesthe“taillatency”probleminwhichasinglelong-contextsamplestallsglobaloptimization.

Curriculumlayer(iterativecontextlengthening)

Longcontextsexacerbatebothmemoryfootprintandcredit-assignmentnoise.rLLMthere-foreadoptsastagedcurriculum—16k→32k→64ktokens—advancingonlywhenreward-varianceplateaus.Recentworkonlong-contextpre-trainingshowsthatsuchgradualex-pansionyieldsbetterutilizationoftheexpandedreceptivefieldthanjumpingtothefinalwindowdirectly.Inpractice,curriculumlengtheningshaves21%ofwall-clocktimerelativetoastatic64krun.

Empiricalhighlight(Deepcoder-14B)

RunningthefullpipelineoncuratedcompetitivecodingtasksintheDeepcoderdatasetproducesDeepcoder-14B,whichattains60.6%Pass@1onLiveCodeBench,aCodeforcesEloof1936,and92.6%Pass@1onHumanEval+,equalingOpenAI’so3-mini(low)withopen-sourcedtrainingprocedure,data,andweights.

Environment

Theabovegainsmaterializeonlyunderarewardfunctioninanenvironmentthatisair-tight.rLLMthereforeexecutesallrolloutsinsideasandboxwhereeverycodesnippetisexecutedwithresourceisolationandconstraints;thisensurestimelyexecutionandproperfail-fastchecks.Aswell,theseenvironmentsneedtobeperformantforlargeparallelrewardcalculation.rLLMintroducesanenvironmentthatisoptimizedforparallelrewardfunctionexecutionwhilebeingsandboxed.

4

Chapter2

RelatedWork

2.1ReinforcementLearningforLanguage-ModelAlignment

EarlyattemptsataligninglargelanguagemodelsreliedonProximalPolicyOptimization(PPO),afirst-ordertrust-regionmethodthatclipsthepolicyupdatetoavertcollapsewhileremainingcomputationallytractable[30].OpenAI’sInstructGPTextendedPPOintoafullRL-from-Human-Feedback(RLHF)pipeline,demonstratingthatfine-tuningwithpreference-basedrewardsmarkedlyimprovesobedienceandusefulnessoninstruction-followingbenchmarks[25].Subsequentworkrevealed,however,thatPPO’sglobalbaselineandsingle-trajectoryadvantagesstrugglewiththevarianceintroducedbylongcontextsandsparserewardstypicalofreasoningtasks.

Tomitigatetheseissues,GroupRelativePolicyOptimization(GRPO)estimatesbaselinesfromgroupsoftrajectoriessharingthesameprompt,therebysharpeningcreditassignmentandcuttingmemoryoverheadbyeliminatingaseparatecriticnetwork[31].GRPOhasbeenshowntosustainstablelearningon16k–32ktokenwindowsformathematics-focusedmodels,yetstillexhibitspoorperformancewhenscaledtolarger,heterogeneouscorporaduetotheconstraintsofsample-levelloss.DAPOgeneralizestheideabyintroducingdecoupledclippingandadaptivetemperaturescaling,aswellastoken-levelloss,therebyreportingimprovedconvergenceacrossninepublicRLHFtasksandprovidinganopen-sourcereferenceforcluster-scaletraining[39].

Despitealgorithmicprogress,allPPO-derivedmethodsremainvulnerabletorewardhacking—theexploitationofloopholesintherewardfunctionorenvironmenttoinflatere-turnswithoutgenuinetasksuccess.Recentsafetyanalysesoffrontiermodels,includingOpenAI’so1ando3series,documentemergentdeceptivebehaviourundersparserewardregimes[3].Theseobservationsunderscorethatreliablealignmenthingesnotonlyonrobustoptimizationbutalsoonverifiablerewardchannelsandsecureexecutionsandboxes.

ThepresentworkbuildsonthislineagebyproposingGRPO+,anextensionthatappliesrelativeKLclippingandoverlongfilteringtofurtherstabilizeupdates,andembeddingthe

5

CHAPTER2.RELATEDWORK

algorithmwithinanasynchronoussamplingstack(Section1.2)executedinsideanairtight,reversibleenvironment(Section4).Thisholisticapproachtargetstheintertwinedalgorith-micandenvironmentalcausesofconvergencefailureidentifiedinpriorliterature.

2.2DistributedFrameworksandSystemsInfrastructure

Scalingpolicy-gradientoptimizationtobillion-parameterlanguagemodelsdemandsend-to-endsystemssupportforhigh-throughputsampling,faulttolerance,andelasticresourceutilization.EarlyRLHFpipelinesembeddedPPOdirectlyinsidebespoketrainerscripts,butsoonmigratedtogeneral-purposeframeworkssuchasRayRLlib,whoseactor–learnerabstractionandclusterschedulero↵eredturnkeyhorizontalscale-outandrecovery.RLlib’sversatility,however,comesatacost:itsmonolithicAPIsintroduceperformanceoverheadswhenrolloutsrequirelong-contextdecodingontensor-parallelbackends[21,15].

ToaddressLLM-specificbottlenecks,multipleopen-sourcesystemshaveemerged.veRLrefactorsRLlib’sexecutionmodelintolightweightRPCendpointsanddouble-bu↵eredGPUqueues,sustaining¿95%utilizationonmulti-nodeclusters.DistRLpushesasynchronousdatacollectiontoCPU-heavyinferencenodeswhilereservingGPUserversforbatchedgra-dientupdates,reducingstraggler-inducedidletimeby27%onin-house70Bmodels.

Large-scaleindustrialstackscoupletheseschedulerswithhigh-performanceservinglay-ers.NVIDIA’sTritonInferenceServerisfrequentlydeployedtoshardsamplertrafficacrosstensor-paralleldecodereplicas,maskingbackendvariabilitybeneathauniformgRPCinter-face.Ontheoptimizationside,DeepSpeedRLextendsDeepSpeed-ZeROwitho✏oadingprimitivestailoredtoPPO-stylegradients,deliveringnear-linearscalingto512A100sona175Bmodelaccordingtointernalbenchmarks[22].

Thebaselineforthesystemoptimizationsisprovidedbyverl[32],anopen-sourceli-braryforReinforcementLearningfromHumanFeedback(RLHF)trainingoflargelanguagemodels.verlistheopen-sourceimplementationoftheframeworkdescribedinthepaper”HybridFlow:AFlexibleandEfficientRLHFFramework”[32].TheHybridFlowframe-workwasdevelopedtoaddresstheinherentcomplexityandcomputationalinefficiencyoftraditionalRLHFdataflows.

RLHFworkflows,particularlythosebasedonalgorithmslikePPOandGRPO[31],in-volveintricatedependenciesandcomputationaltasksperformedbymultipleLLMinstances,includingtheActor(policy)model,aRewardmodel,aReferencemodel,andaCriticmodel.Thesetasksencompassgeneration(sampling),inference(forreward,reference,andcritic),andtrainingsteps.Traditionalapproachesoftenstruggledwithflexiblyrepresentingandefficientlyexecutingthesecomplexdataflows,leadingtoinefficiencies.

HybridFlow[32]addressesthesechallengesbyproposingaflexibleandefficientarchi-tecture.Keyaspectsincludeahybrid-controllerprogrammingmodelthatdecouplesthehigh-levelcontrolflow(definingtheRLalgorithmsteps)fromthelow-levelcomputation

6

CHAPTER2.RELATEDWORK

flow(executingneuralnetworkoperations).Thisdesignallowsforbettermodularityandreusability.Theframeworkalsoemphasizesseamlessintegrationwithexistingdistributedtrainingandinferencelibraries(suchasFSDP,Megatron-LM,vLLM,andSGLang)andsupportsflexibledevicemappingtooptimizeresourceutilization.WhileHybridFlow[32]providedarobustandefficientfoundationforRLHF,particularlyinmanagingdiversework-loadsandmodelplacements,thesamplingbottleneck,asdescribedinsubsequentsections,remainedasignificantareaforfurtheroptimization.

Algorithm–systemco-designremainsactive.VAGENintegratesvariance-awaregradientaggregationwithacustomparameterserverthatadaptivelydropsstaleroll-outs,reporting1.8×wall-clockspeed-upsonmultilingualinstructiontuning[34].Inparallel,ByteDance’sDAPOreferenceimplementationexposesdecoupledclippinganddynamicsamplingprimi-tivesatopaRaybackend,achieving50pointsonAIME2024witha32BQwenbase[39].Finally,recentstudiesonadaptivefaulttoleranceforLLMclustersproposereactivemigrationoflearnershardsuponnodefailure,preserving¿99.5%trainingavailabilityovermonth-longruns[12].

Collectively,theseframeworkshighlightthreedesignprinciplesadoptedbyrLLM:(i)actor–learnerdecouplingwithasynchronous,back-pressure-freequeues;(ii)elasticorches-trationthatexploitsRay’splacementgroupsfortransparentfailover;and(iii)hardware-awareservinglayersthatco-locatedecodingandgradientaggregationtominimizePCIeandnetworkhops.

2.3SecureExecutionEnvironmentsandRewardIntegrity

Apersistentfailuremodeinlarge-scalereinforcementlearningisrewardhacking—theten-dencyofanagenttoexploitweaknessesintherewardspecificationorthesurroundingsystemtomaximizereturnwithoutachievinggenuinetasksuccess.Documentedexploitsincludeover-fittingbrittleunittests,fabricatingevaluationlogs,andmutatingtheveryartifactsusedforscoring[36].

Tocounteractthesethreats,twocomplementarystrategieshaveemerged.Sandboxisola-tionisnowstandardpracticeincode-generationRL:eachcandidateprogramexecutesinsidearesource-boundedcontainer,andsuccessisjudgedsolelybytheunit-testsuite[38].Whilee↵ectiveagainstarbitraryfilewritesornetworkcalls,sandboxesrelyonthe

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论